The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Nathaniel Li ,

Alexander Pan ,

Anjali Gopal ,

Summer Yue ,

Daniel Berrios ,

Alice Gatti ,

Justin D. Li ,

Ann-Kathrin Dombrowski ,

Shashwat Goel ,

Long Phan ,

Gabriel Mukobi ,

Nathan Helm-Burger ,

Rassin Lababidi ,

Lennart Justen ,

Andrew B. Liu ,

Michael Chen ,

Isabelle Barrass ,

Oliver Zhang ,

Xiaoyuan Zhu ,

Rishub Tamirisa ,

Bhrugu Bharathi ,

Adam Khoja ,

Zhenqi Zhao ,

Ariel Herbert-Voss ,

Cort B. Breuer ,

Sam Marks ,

Oam Patel ,

Andy Zou ,

Mantas Mazeika ,

Zifan Wang ,

Palash Oswal ,

Weiran Liu ,

Adam A. Hunt ,

Justin Tienken-Harder ,

Kevin Y. Shih ,

Kemper Talley ,

John Guan ,

Russell Kaplan ,

Ian Steneker ,

David Campbell ,

Brad Jokubaitis ,

Alex Levinson ,

Jean Wang ,

William Qian ,

Kallol Krishna Karmakar ,

Steven Basart ,

Stephen Fitz ,

Mindy Levine ,

Ponnurangam Kumaraguru ,

Uday Tupakula ,

Vijay Varadharajan ,

Yan Shoshitaishvili ,

Jimmy Ba ,

Kevin M. Esvelt ,

Alexandr Wang ,

Dan Hendrycks

热度 633

2024年03月05日

简介

白宫关于人工智能的行政命令强调了大型语言模型（LLMs）赋予恶意行为者开发生物、网络和化学武器的风险。为了衡量这些恶意使用的风险，政府机构和主要的人工智能实验室正在开发评估LLMs的危险能力。然而，目前的评估是私有的，阻止了进一步研究以减轻风险。此外，它们只关注了一些高度特定的恶意使用途径。为了填补这些空白，我们公开发布了大规模杀伤性武器代理（WMDP）基准测试，这是一个由4157道多项选择题组成的数据集，用作生物安全、网络安全和化学安全危险知识的代理测量。WMDP由一组学者和技术顾问开发，并在公开发布之前进行了严格的过滤，以消除敏感信息。WMDP有两个作用：首先，作为评估LLMs中危险知识的工具，其次，作为去除这种危险知识的遗忘方法的基准。为了指导遗忘的进展，我们开发了CUT，这是一种基于控制模型表示的最先进的遗忘方法。CUT减少了模型在WMDP上的表现，同时保持了在生物学和计算机科学等领域的一般能力，这表明遗忘可能是减少LLMs恶意使用的具体途径。我们在https://wmdp.ai上公开发布了我们的基准测试和代码。
图表
解决问题

WMDP benchmark and CUT unlearning method are proposed to measure and mitigate the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons.
关键思路

WMDP benchmark is a dataset of multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. CUT is a state-of-the-art unlearning method based on controlling model representations to reduce model performance on WMDP while maintaining general capabilities in areas such as biology and computer science.
其它亮点

WMDP benchmark is publicly released to evaluate hazardous knowledge in LLMs and as a benchmark for unlearning methods to remove such hazardous knowledge. The dataset was developed by a consortium of academics and technical consultants, and was filtered to eliminate sensitive information prior to public release. CUT shows promising results in reducing the risks of malicious use from LLMs. The benchmark and code are publicly available at https://wmdp.ai.
相关研究

The White House Executive Order on Artificial Intelligence highlights the risks of LLMs empowering malicious actors in developing weapons. Other related studies include 'GPT-2: Language Models are Unsupervised Multitask Learners' by Radford et al. and 'Language Models are Few-Shot Learners' by Brown et al.

PDF

原文

点赞收藏评论分享到Link

沙发等你来抢

去评论