来自今日爱可可推介

[CL] Constitutional AI: Harmlessness from AI Feedback

Y Bai, S Kadavath, S Kundu…
[Anthropic]

题目:Constitutional AI:  AI反馈达成无害性

要点

  1. 尝试通过自我改进,无需人工标注训练一个无害AI助手;

  2. 提出"原发人工智能(Constitutional AI)",整个过程包括监督学习和强化学习两个阶段;

  3. 通过思维链式推理训练AI助手,提高人工判断的性能和AI决策的透明度。

摘要

随着人工智能系统变得更有能力,人们希望利用其帮助来监督其他人工智能。本文试验了通过自我改进来训练无害的人工智能助手的方法,没有任何人工标注来识别有害输出。唯一的人工监督是通过规则或原则清单提供的,因此把这种方法称为"原发人工智能(Constitutional AI)"。整个过程包括监督学习和强化学习两个阶段。在监督阶段,从初始模型中取样,产生自我批评和修订,根据修订后的反应对原始模型进行微调;在强化学习阶段,从微调后的模型中取样,用一个模型来评估两个样本中哪个更好,从这个人工智能偏好数据集中训练一个偏好模型。用偏好模型作为奖励信号进行强化学习训练,即使用"AI反馈强化学习"(RLAIF)。这样,能训练出一个无害但不具侵略性的人工智能助手,通过解释其反对意见来与有害的查询进行接触。SL和RL方法都可以利用思维链式的推理来提高人工判断的性能和AI决策的透明度。这些方法使得更精确地控制人工智能的行为成为可能,并大大减少了人工的标签。

论文https://arxiv.org/abs/2212.08073

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.





 

 

内容中包含的图片若涉及版权问题,请及时与我们联系删除