WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

2024年06月07日
  • 简介
    我们介绍了WildBench,这是一个自动化评估框架,旨在使用具有挑战性的真实用户查询来对大型语言模型(LLMs)进行基准测试。WildBench由1024个任务组成,这些任务是从一百多万个人与聊天机器人的对话记录中精心挑选出来的。为了使用WildBench进行自动评估,我们开发了两个指标,即WB-Reward和WB-Score,可以使用高级LLMs(如GPT-4-turbo)进行计算。WildBench评估使用任务特定的检查清单系统地评估模型输出,并提供结构化的解释,以证明得分和比较,从而产生更可靠和可解释的自动判断。WB-Reward使用细粒度的模型响应之间的成对比较,生成五个潜在的结果:更好得多、稍微更好、稍微更差、更差得多或平局。与以前只使用单个基准模型的评估不同,我们选择了三个基准模型,以确保全面的成对评估。此外,我们提出了一种简单的方法来减轻长度偏差,即如果获胜者的响应比输家的响应多$K$个字符,则将“稍微更好/更差”的结果转换为“平局”。WB-Score单独评估模型输出的质量,使其成为一种快速和成本效益的评估指标。WildBench的结果表明,在难度较大的任务上,与Chatbot Arena上的人类投票Elo评分有很强的相关性。具体而言,WB-Reward与排名前几的模型的Pearson相关系数为0.98。此外,WB-Score达到0.95,超过了ArenaHard的0.91和AlpacaEval2.0的0.89的长度控制胜率,以及常规胜率的0.87。
  • 作者讲解
  • 图表
  • 解决问题
    WildBench: A Benchmarking Framework for Large Language Models Using Real-World User Queries
  • 关键思路
    The paper presents WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using real-world user queries. The framework consists of 1,024 tasks selected from over one million human-chatbot conversation logs, and two metrics, WB-Reward and WB-Score, are developed for automated evaluation. The evaluation uses task-specific checklists to provide structured explanations that justify the scores and comparisons, resulting in more reliable and interpretable automatic judgments.
  • 其它亮点
    WildBench achieves strong correlation with human-voted Elo ratings from Chatbot Arena on hard tasks. WB-Reward achieves a Pearson correlation of 0.98 with top-ranking models, and WB-Score reaches 0.95, surpassing both ArenaHard's 0.91 and AlpacaEval2.0's 0.89 for length-controlled win rates, as well as the 0.87 for regular win rates.
  • 相关研究
    Related work includes previous evaluation frameworks for language models, such as GLUE and SuperGLUE, as well as other benchmark datasets like LAMBADA and COQA. The paper also discusses the limitations of current evaluation methods and the need for more comprehensive and realistic benchmarks for LLMs.
许愿开讲
PDF
原文
点赞 收藏
向作者提问
NEW
分享到Link

提问交流

提交问题,平台邀请作者,轻松获得权威解答~

向作者提问