MetaAligner: Conditional Weak-to-Strong Correction for Generalizable Multi-Objective Alignment of Language Models

2024年03月25日
  • 简介
    最近大型语言模型的进展旨在通过多目标偏好对齐来解决异质人类期望和价值观的问题。然而,现有方法对策略模型具有参数依赖性,导致两个关键限制:(1)为每个新目标模型重复对齐算法的高成本;(2)由于其静态对齐目标,它们无法扩展到未见目标。在本文中,我们提出了元目标对齐器(MetaAligner),这是一种执行弱到强条件纠正的模型,以接近强响应。MetaAligner是第一个策略无关和可推广的多目标偏好对齐方法,通过将参数更新与策略模型分离,实现即插即用的对齐,并通过上下文学习促进了未见目标的零-shot偏好对齐。实验结果显示,MetaAligner在11个策略模型上实现了显着且平衡的多目标对齐改进,参数多达63倍,并且比以前的对齐方法在计算资源减少22.27倍的情况下表现更好。该模型还可以准确地对齐未见目标,标志着通向可推广的多目标偏好对齐的第一步。
  • 作者讲解
  • 图表
  • 解决问题
    MetaAligner: A Policy-Agnostic Meta-Objective Aligner for Multi-Objective Preference Alignment
  • 关键思路
    MetaAligner is a policy-agnostic and generalizable method for multi-objective preference alignment, which performs conditional weak-to-strong correction for weak responses to approach strong responses. It decouples parameter updates from the policy models, enabling plug-and-play alignment and zero-shot preference alignment for unseen objectives via in-context learning.
  • 其它亮点
    MetaAligner achieves significant and balanced improvements in multi-objective alignments on 11 policy models with up to 63x more parameters, and outperforms previous alignment methods with down to 22.27x less computational resources. The model accurately aligns with unseen objectives, marking the first step towards generalizable multi-objective preference alignment.
  • 相关研究
    Related work includes recent advancements in large language models (LLMs) for multi-objective preference alignment, which are parameter-adherent to the policy model and have static alignment objectives. Some examples of related papers are 'Multi-Objective Training of Transformer-Based Dialogue Models with Importance Sampling' and 'Multi-Objective Dialogue Policy Learning via Pareto-Efficient Ranking'.
许愿开讲
PDF
原文
点赞 收藏
向作者提问
NEW
分享到Link

提问交流

提交问题,平台邀请作者,轻松获得权威解答~

向作者提问