Removing RLHF Protections in GPT-4 via Fine-Tuning

简介

随着大型语言模型（LLMs）的能力增强，它们的潜在双重用途也随之增加。为了减少有害输出，LLMs的生产商和供应商采用了人类反馈的强化学习（RLHF）。同时，LLM供应商越来越多地启用了他们最强大模型的微调。然而，同时进行的工作表明，微调可能会去除RLHF保护。我们可以预期目前最强大的模型（GPT-4）更不容易受到微调攻击。在这项工作中，我们展示了相反的结果：微调使攻击者可以通过仅使用340个示例和95％的成功率来去除RLHF保护。这些训练示例可以使用较弱的模型自动生成。我们进一步展示，去除RLHF保护不会降低非审查输出的有用性，这为我们的微调策略不会降低有用性提供了证据，尽管使用较弱的模型生成训练数据。我们的结果表明需要进一步研究LLMs的保护措施。
图表
解决问题

The paper aims to demonstrate that fine-tuning can remove RLHF (reinforcement learning with human feedback) protections in LLMs (large language models), which is a new problem in the field of LLMs.
关键思路

The key idea of the paper is to use fine-tuning to remove RLHF protections in LLMs, which can be achieved with a small number of training examples automatically generated with weaker models. The paper also shows that removing RLHF protections does not decrease the usefulness of LLMs on non-censored outputs.
其它亮点

The experiments in the paper are designed to demonstrate the effectiveness of the fine-tuning approach in removing RLHF protections, as well as the usefulness of the resulting models on non-censored outputs. The paper also highlights the need for further research on protections for LLMs. The paper does not mention any open-source code or datasets used in the experiments.
相关研究

Recent related work in this field includes research on the use of RLHF to reduce harmful outputs in LLMs, as well as studies on the vulnerability of LLMs to adversarial attacks. Some related papers include 'Reducing Harmful Bias in Language Models with Reinforcement Learning' by Weston et al. and 'Adversarial Attacks on Large Language Models' by Alzantot et al.

Removing RLHF Protections in GPT-4 via Fine-Tuning

评论