Exploring Safety Generalization Challenges of Large Language Models via Code

简介

大型语言模型（LLM）的快速发展带来了自然语言处理方面的显著能力，但也引发了人们对其潜在滥用的担忧。尽管监督微调和通过人类反馈进行强化学习等策略增强了它们的安全性，但这些方法主要集中在自然语言领域，可能无法推广到其他领域。本文介绍了CodeAttack，这是一个将自然语言输入转换为代码输入的框架，为测试LLM的安全泛化提供了一个新环境。我们对包括GPT-4、Claude-2和Llama-2系列在内的最先进的LLM进行了全面研究，发现这些模型在面对代码输入时存在普遍的安全漏洞：CodeAttack在超过80％的情况下一致绕过所有模型的安全保护措施。此外，我们发现CodeAttack和自然语言之间的分布差距越大，安全泛化能力就越弱，例如使用数据结构对自然语言输入进行编码或使用不太流行的编程语言。这些发现突显了代码领域的新安全风险，需要更强大的安全对齐算法来匹配LLM的代码能力。
图表
解决问题

CodeAttack: Testing the Safety Generalization of Large Language Models against Code Inputs
关键思路

CodeAttack is a framework that transforms natural language inputs into code inputs to test the safety generalization of LLMs. The paper finds that state-of-the-art LLMs are vulnerable to safety risks in the code domain, and a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization.
其它亮点

The paper comprehensively studies state-of-the-art LLMs including GPT-4, Claude-2, and Llama-2 series and finds a common safety vulnerability of these models against code input. CodeAttack consistently bypasses the safety guardrails of all models more than 80% of the time. The paper highlights the need for more robust safety alignment algorithms to match the code capabilities of LLMs.
相关研究

Related work includes strategies like supervised fine-tuning and reinforcement learning from human feedback to enhance the safety of LLMs in natural language processing.

Exploring Safety Generalization Challenges of Large Language Models via Code

评论