CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion

向作者提问

NEW

简介

本文介绍了CodeAttack，这是一个将自然语言输入转换为代码输入的框架，为测试LLMs的安全泛化提供了一个新颖的环境。虽然监督微调和从人类反馈中进行强化学习等策略增强了它们的安全性，但这些方法主要关注自然语言，可能无法推广到其他领域。我们对最先进的LLMs进行了全面的研究，包括GPT-4、Claude-2和Llama-2系列，发现了这些模型在代码输入方面存在一个新的普遍安全漏洞：CodeAttack超过80%的时间绕过了所有模型的安全防护。我们发现，CodeAttack和自然语言之间的分布差距越大，安全泛化能力越弱，例如使用数据结构对自然语言输入进行编码。此外，我们提出了关于CodeAttack成功的假设：LLMs在代码训练过程中获得的不对齐偏差，优先考虑代码完成而不是避免潜在的安全风险。最后，我们分析了潜在的缓解措施。这些发现突显了代码领域的新安全风险，需要更加强大的安全对齐算法来匹配LLMs的代码能力。
作者讲解

目前尚无作者解读视频，你可点击下方【许愿开讲】按钮，许愿作者开讲~
图表
解决问题

CodeAttack: Testing the Safety Generalization of Large Language Models against Code Inputs
关键思路

CodeAttack is a framework that transforms natural language inputs into code inputs to test the safety generalization of LLMs. The comprehensive studies on state-of-the-art LLMs reveal a new and universal safety vulnerability of these models against code input, which bypasses the safety guardrails of all models more than 80% of the time.
其它亮点

The experiments are designed to test the safety of LLMs against code inputs. The results show that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization. The misaligned bias acquired by LLMs during code training is hypothesized to be the reason for the success of CodeAttack. Potential mitigation measures are analyzed. The paper highlights new safety risks in the code domain and the need for more robust safety alignment algorithms to match the code capabilities of LLMs.
相关研究

Related work includes supervised fine-tuning and reinforcement learning from human feedback to enhance the safety of LLMs against natural language inputs.

许愿开讲

PDF

原文

点赞收藏

向作者提问

NEW

分享到Link

提问交流

提交问题，平台邀请作者，轻松获得权威解答～

向作者提问