本文来自今日爱可可AI前沿推介

[CL] Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models

B Bohnet, V Q. Tran, P Verga, R Aharoni, D Andor, L B Soares...
[Google Research]

要点:

  1. 提出了面向归因QA的评估框架,采用人工标注和自动度量;
  2. 用最新的组件对模型进行基准测试,并讨论事后和端到端的语言建模归因方法。

摘要:

大型语言模型(LLM)在各种任务中表现出令人印象深刻的结果,同时几乎不需要直接监督。越来越多的证据表明,大型语言模型在信息搜索场景中可能具有潜力,LLM对其生成的文本进行归因的能力对系统开发者和用户来说都很关键。本文提出并研究了归因QA,作为开发归因LLM的关键第一步。本文为这项任务开发了一个可复制的评估框架,采用人工标注作为黄金标准和一个相关的自动指标,并表明该指标适用于开发环境。本文描述了用于该任务的一组广泛的架构,并对其进行了基准测试。本文的贡献对两个关键问题(如何测量归因?以及目前最先进的方法在归因方面的表现如何?)给出了一些具体的答案,并对如何解决第三个关键问题(如何建立归因LLM?)给出一些提示。

论文地址:https://arxiv.org/abs/2212.08037

Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?).





 

 

内容中包含的图片若涉及版权问题,请及时与我们联系删除