Rho-1: Not All Tokens Are What You Need

简介

以往的语言模型预训练方法都会将下一个令牌的预测损失应用于所有训练令牌。然而，我们认为“语言模型训练中，并非语料库中的所有令牌都同等重要”。我们的初步分析探究了语言模型的令牌级训练动态，揭示了不同令牌的不同损失模式。基于这些洞见，我们引入了一种新的语言模型，称为Rho-1。与传统的语言模型不同，Rho-1采用选择性语言建模（SLM），有选择地训练与所需分布对齐的有用令牌，而不是学习预测语料库中的每个下一个令牌。这种方法涉及使用参考模型对预训练令牌进行打分，然后在具有更高超额损失的令牌上进行集中损失的语言模型训练。当在15B OpenWebMath语料库上进行持续预训练时，Rho-1在9个数学任务的少量样本准确性方面取得了高达30%的绝对改进。在微调后，Rho-1-1B和7B在MATH数据集上分别实现了40.6%和51.8%的最新结果，仅使用了预训练令牌的3%，与DeepSeekMath相匹配。此外，当在80B通用令牌上进行预训练时，Rho-1在15个不同任务中实现了平均6.8%的提升，提高了语言模型预训练的效率和性能。
图表
解决问题

Rho-1: Selective Language Modeling for Efficient Large-Scale Pretraining
关键思路

The paper proposes a selective language modeling approach called Rho-1, which focuses on training only on the most useful tokens in a corpus, resulting in improved efficiency and performance of language model pre-training.
其它亮点

Rho-1 achieved an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks, and achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset with only 3% of the pretraining tokens. When pretraining on 80B general tokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks. The paper also provides insights into token-level training dynamics of language models and introduces a new metric called excess loss. The experiments were conducted on 15B OpenWebMath corpus and 80B general tokens, and the code is available on GitHub.
相关研究

Some related studies include 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding', 'GPT-2: Language Models are Unsupervised Multitask Learners', and 'XLNet: Generalized Autoregressive Pretraining for Language Understanding'.

Rho-1: Not All Tokens Are What You Need

评论