HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

2024年02月06日
  • 简介
    自动红队测试对于发现和缓解与大型语言模型(LLMs)的恶意使用相关的风险具有重要的前景,然而该领域缺乏一个标准化的评估框架来严格评估新方法。为了解决这个问题,我们介绍了HarmBench,一个用于自动红队测试的标准化评估框架。我们确定了以前未考虑的几个理想特性,并系统地设计HarmBench以满足这些标准。使用HarmBench,我们对18种红队测试方法和33种目标LLMs和防御进行了大规模比较,得出了新的见解。我们还介绍了一种高效的对抗训练方法,极大地增强了LLM在各种攻击下的鲁棒性,展示了HarmBench如何实现攻击和防御的共同开发。我们在https://github.com/centerforaisafety/HarmBench上开源了HarmBench。
  • 图表
  • 解决问题
    HarmBench: A Benchmark Suite for Evaluating Red Teaming of Large Language Models
  • 关键思路
    Introducing HarmBench, a standardized evaluation framework for automated red teaming of large language models (LLMs) to uncover and mitigate risks associated with malicious use.
  • 其它亮点
    The paper identifies several desirable properties previously unaccounted for in red teaming evaluations and systematically designs HarmBench to meet these criteria. A large-scale comparison of 18 red teaming methods and 33 target LLMs and defenses is conducted using HarmBench, yielding novel insights. A highly efficient adversarial training method is introduced that greatly enhances LLM robustness across a wide range of attacks. HarmBench is open-sourced at https://github.com/centerforaisafety/HarmBench.
  • 相关研究
    Related work includes recent research on adversarial attacks and defenses for large language models, such as "Adversarial Attacks and Defenses in Text: A Survey" and "Towards Evaluating the Robustness of Neural Networks for Text Classification".
PDF
原文
点赞 收藏 评论 分享到Link

沙发等你来抢

去评论