分享

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

热度