VERISCORE: Evaluating the factuality of verifiable claims in long-form text generation

向作者提问

NEW

简介

现有的用于评估长篇文本真实性的度量标准，如FACTSCORE（Min等人，2023年）和SAFE（Wei等人，2024年），将输入文本分解为“原子声明”，并针对维基百科等知识库验证每个声明。这些度量标准不适用于大多数生成任务，因为它们假定每个声明都是可验证的（即可以合理地证明真或假）。我们通过VERISCORE解决了这个问题，VERISCORE是一个度量标准，适用于包含可验证和不可验证内容的多样化长篇生成任务。VERISCORE可以有效地使用封闭或微调的开放权重语言模型实现，人类评估证实，在八个不同的长篇任务中，VERISCORE提取的声明比竞争方法更合理。我们使用VERISCORE评估了来自16个不同模型的多个长篇任务的生成结果，并发现虽然GPT-4o是整体表现最佳的模型，但Mixtral-8x22等开放权重模型正在缩小差距。我们表明，语言模型在一个任务上的VERISCORE（例如传记生成）不一定与其在另一个任务上的VERISCORE（例如长篇问答）相关，凸显了扩展事实性评估跨不同密度任务的需要。
作者讲解

目前尚无作者解读视频，你可点击下方【许愿开讲】按钮，许愿作者开讲~
图表
解决问题

VERISCORE: A Metric for Diverse Long-Form Generation Tasks with Verifiable and Unverifiable Content
关键思路

The VERISCORE metric is proposed to evaluate the factuality of long-form generation tasks that contain both verifiable and unverifiable content, which is not suitable for existing metrics. VERISCORE can effectively extract sensible claims from diverse long-form tasks and evaluate models' factuality performance across tasks with varying fact density.
其它亮点

The VERISCORE metric is evaluated on 16 different models across multiple long-form tasks, and GPT-4o is found to be the best-performing model overall. Open-weight models such as Mixtral-8x22 are closing the gap. The extracted claims by VERISCORE are more sensible than those from competing methods. The metric can be implemented with closed or fine-tuned open-weight language models. The need for expanding factuality evaluation across tasks with varying fact density is highlighted.
相关研究

Existing metrics for evaluating the factuality of long-form text, such as FACTSCORE and SAFE, decompose an input text into atomic claims and verify each against a knowledge base like Wikipedia. However, these metrics are not suitable for most generation tasks because they assume that every claim is verifiable. No related work is mentioned in the abstract.

许愿开讲

PDF

原文

点赞收藏

向作者提问

NEW

分享到Link

提问交流

提交问题，平台邀请作者，轻松获得权威解答～

向作者提问