利用模型编写评价发现语言模型行为

来自今日爱可可AI前沿推介(12.21)

Discovering Language Model Behaviors with Model-Written Evaluations

E Perez, S Ringer, K Lukošiūtė…
[Anthropic]

利用模型编写评价发现语言模型行为

要点:

自动使用语言模型进行评估，可减少耗时且昂贵的众包工作；
发现语言模型随着规模增大而变差的新情况；
语言模型编写的评估质量很高，可以快速发现许多新的语言模型行为。

摘要：

随着语言模型(LM)的扩展，它们会发展出许多新的行为，有好有坏，这就更需要评估它们的行为方式。之前的工作是通过人工工作(费时费力)或现有的数据源(并不总是可用)来进行评估。本文尝试用语言模型自动生成评价。探索了不同程度的人工努力的方法，从指示语言模型写是/否的问题到用基于语言模型的生成和过滤的多个阶段制作复杂的Winogender模式。众包人员将这些例子评为高度相关，且对90-100%的标签表示认同，甚至超过了相应的人工编写的数据集。生成了154个数据集，并发现了逆向缩放的新情况，语言模型随着规模的增大而变差。较大的语言模型会重复对话用户的首选答案("谄媚")，并表示更希望追求有关的目标，如资源获取和目标保护。本文还发现了一些来自人工反馈(RLHF)的反比例的第一个例子，即更多的RLHF使LM变得更糟。例如，RLHF使LM表达了更强烈的政治观点(关于枪支权利和移民)，并更希望避免被关闭。总的来说，语言模型变形的评价是高质量的，可以迅速发现许多新的语言模型行为。

https://arxiv.org/abs/2212.09251

As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.

内容中包含的图片若涉及版权问题，请及时与我们联系删除

利用模型编写评价发现语言模型行为

评论列表

评论