AI反馈达成无害性

来自今日爱可可推介

[CL] Constitutional AI: Harmlessness from AI Feedback

Y Bai, S Kadavath, S Kundu…
[Anthropic]

题目：Constitutional AI: AI反馈达成无害性

要点：

尝试通过自我改进，无需人工标注训练一个无害AI助手；
提出"原发人工智能(Constitutional AI)"，整个过程包括监督学习和强化学习两个阶段；
通过思维链式推理训练AI助手，提高人工判断的性能和AI决策的透明度。

摘要：

随着人工智能系统变得更有能力，人们希望利用其帮助来监督其他人工智能。本文试验了通过自我改进来训练无害的人工智能助手的方法，没有任何人工标注来识别有害输出。唯一的人工监督是通过规则或原则清单提供的，因此把这种方法称为"原发人工智能(Constitutional AI)"。整个过程包括监督学习和强化学习两个阶段。在监督阶段，从初始模型中取样，产生自我批评和修订，根据修订后的反应对原始模型进行微调；在强化学习阶段，从微调后的模型中取样，用一个模型来评估两个样本中哪个更好，从这个人工智能偏好数据集中训练一个偏好模型。用偏好模型作为奖励信号进行强化学习训练，即使用"AI反馈强化学习"(RLAIF)。这样，能训练出一个无害但不具侵略性的人工智能助手，通过解释其反对意见来与有害的查询进行接触。SL和RL方法都可以利用思维链式的推理来提高人工判断的性能和AI决策的透明度。这些方法使得更精确地控制人工智能的行为成为可能，并大大减少了人工的标签。

论文：https://arxiv.org/abs/2212.08073

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

内容中包含的图片若涉及版权问题，请及时与我们联系删除

AI反馈达成无害性

评论列表

评论