分享

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

热度