MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

2024年04月08日
  • 简介
    大型语言模型(LLMs)有潜力促进人工智能技术的发展,以协助医疗专家进行交互式决策支持,这已经通过它们在医学问答中的竞争表现得到证明。然而,尽管表现令人印象深刻,但医疗应用所需的质量标准仍远未达到。目前,LLMs仍然面临着过时的知识和生成幻觉内容的倾向的挑战。此外,大多数用于评估医学知识的基准测试缺乏参考黄金解释,这意味着无法评估LLMs预测的推理。最后,如果我们考虑将LLMs用于除英语以外的其他语言的基准测试,则情况尤其严峻,这在我们所知道的范围内是一个完全被忽视的话题。为了解决这些缺点,在本文中,我们提出了MedExpQA,这是第一个基于医学考试的多语言基准测试,用于评估LLMs在医学问答中的表现。据我们所知,MedExpQA首次包括由医生撰写的参考黄金解释,可以利用这些解释来建立与LLMs性能比较的各种基于黄金标准的上限。使用金标准参考解释和检索增强生成(RAG)方法进行全面的多语言实验表明,LLMs的表现仍有很大的提升空间,特别是对于英语以外的语言。此外,尽管使用了最先进的RAG方法,我们的结果还表明,获取和整合可用的医学知识以对医学问答的下游评估结果产生积极影响的难度很大。到目前为止,该基准测试可用于四种语言,但我们希望这项工作可以鼓励对其他语言的进一步开发。
  • 图表
  • 解决问题
    MedExpQA: A Medical Multilingual Benchmark for Explainable Question Answering
  • 关键思路
    MedExpQA is the first multilingual benchmark based on medical exams to evaluate Large Language Models (LLMs) in Medical Question Answering, with reference gold explanations written by medical doctors for establishing gold-based upper-bounds. The paper shows that LLMs still have large room for improvement, especially for languages other than English.
  • 其它亮点
    The paper presents MedExpQA, a multilingual benchmark for Medical Question Answering using LLMs, with gold reference explanations written by medical doctors. The benchmark is available in four languages. The paper also demonstrates the difficulty of integrating readily available medical knowledge that may positively impact results on downstream evaluations. The experiments used Retrieval Augmented Generation (RAG) approaches. No open-source code is mentioned.
  • 相关研究
    Recent related work includes benchmarking LLMs for Medical Question Answering, but MedExpQA is the first multilingual benchmark with reference gold explanations. Other related work includes studies on the challenges of integrating medical knowledge into LLMs and improving their performance for Medical QA.
PDF
原文
点赞 收藏 评论 分享到Link

沙发等你来抢

去评论