Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

简介

大型视觉语言模型（LVLMs）在医疗应用中越来越重要，包括医学视觉问答和图像报告生成。虽然这些模型继承了基础大型语言模型（LLMs）的强大能力，但它们也继承了幻觉的易感性——这是在高风险的医疗环境中极为关注的问题，因为误差的余地很小。然而，目前在医学领域中没有专门用于幻觉检测和评估的方法或基准。为了弥补这一空白，我们介绍了Med-HallMark，这是专门为医学多模态领域中的幻觉检测和评估而设计的第一个基准。该基准提供了多任务幻觉支持、多方面的幻觉数据和分层幻觉分类。此外，我们提出了MediHall Score，这是一种新的医学评估指标，通过一个考虑幻觉严重程度和类型的分层评分系统来评估LVLMs的幻觉，从而实现对潜在临床影响的细致评估。我们还提出了MediHallDetector，这是一种新型的医学LVLM，专门用于精确的幻觉检测，采用多任务训练进行幻觉检测。通过广泛的实验评估，我们在我们的基准上建立了流行的LVLMs的基线。研究结果表明，MediHall Score相比传统指标提供了更细致的幻觉影响理解，并展示了MediHallDetector的增强性能。我们希望这项工作可以显著提高LVLMs在医疗应用中的可靠性。本工作的所有资源将很快发布。
图表
解决问题

Med-HallMark: A Benchmark for Hallucination Detection and Evaluation in Medical Multimodal Language Models
关键思路

The paper introduces Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation within the medical multimodal domain. It proposes MediHall Score, a new medical evaluative metric designed to assess LVLMs' hallucinations through a hierarchical scoring system that considers the severity and type of hallucination, and presents MediHallDetector, a novel Medical LVLM engineered for precise hallucination detection, which employs multitask training for hallucination detection.
其它亮点

The paper provides multi-tasking hallucination support, multifaceted hallucination data, and hierarchical hallucination categorization. It establishes baselines for popular LVLMs using the benchmark and demonstrates the enhanced performance of MediHallDetector. The paper hopes to significantly improve the reliability of LVLMs in medical applications. All resources of this work will be released soon.
相关研究

There are currently no dedicated methods or benchmarks for hallucination detection and evaluation in the medical field. The paper does not list any related works in this specific area.

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

评论