ECBD: Evidence-Centered Benchmark Design for NLP

简介

Benchmarking被视为评估自然语言处理(NLP)进展的关键。然而，创建基准涉及许多设计决策（例如，包括哪些数据集，使用哪些度量），这些决策通常依赖于隐含的、未经测试的假设，关于基准旨在衡量什么或实际上正在衡量什么。目前没有一种原则性的方法来分析这些决策以及它们如何影响基准测量的有效性。为了填补这一空白，我们借鉴了教育评估中的证据中心设计，并提出了Evidence-Centered Benchmark Design (ECBD)，这是一个框架，将基准设计过程规范化为五个模块。ECBD指定了每个模块在帮助从业者收集感兴趣的能力证据方面所扮演的角色。具体而言，每个模块要求基准设计人员描述、证明和支持基准设计选择——例如，清楚地指定基准旨在衡量哪些能力或如何从模型响应中收集关于这些能力的证据。为了演示ECBD的使用，我们对三个基准进行了案例研究：BoolQ、SuperGLUE和HELM。我们的分析揭示了基准设计和文档记录中的共同趋势，这可能会威胁基准测量的有效性。
作者讲解

目前尚无作者解读视频，你可点击下方【许愿开讲】按钮，许愿作者开讲~
图表
解决问题

ECBD framework proposes a principled way of analyzing the design decisions of NLP benchmarks and their impact on the validity of measurements.
关键思路

ECBD formalizes benchmark design process into five modules, requiring benchmark designers to describe, justify, and support design choices to collect evidence about capabilities of interest.
其它亮点

Case studies with BoolQ, SuperGLUE, and HELM reveal common trends in benchmark design and documentation that could threaten the validity of measurements.
相关研究

Recent related work includes 'The Case for NLP Benchmarking - or - The Law of Unintended Consequences' by Bender and Koller, and 'On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?' by Bender et al.

ECBD: Evidence-Centered Benchmark Design for NLP

提问交流

提问交流