分享

Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs

热度