Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective

2024年10月07日
  • 简介
    目前,训练语言模型需要预先确定一个固定的计算预算,因为典型的余弦学习率调度取决于总步数。相比之下,Warmup-Stable-Decay(WSD)调度使用恒定的学习率产生一系列迭代的主干,原则上可以在没有预先指定计算预算的情况下无限继续。然后,针对任何计算预算,可以在任何时候以快速衰减的学习率从主干处分支出去,以产生一个强大的模型。实验证明,WSD生成了一个非传统的损失曲线:损失在稳定阶段保持高水平,但在衰减阶段急剧下降。为了解释这一现象,我们假设预训练损失呈现出一个河谷景观,类似于一个河底有河流的深谷。在这个假设下,我们展示了在稳定阶段,由于高学习率,迭代会经历大幅波动,但沿着河流快速前进。在衰减阶段,快速下降的学习率最小化了迭代的波动,使其更接近河流,并揭示了真正的优化进展。因此,持续的高学习率阶段和快速衰减阶段分别负责河流和山脉方向的进展,两者都是至关重要的。我们的分析预测了与实证观察一致的现象,并表明这种景观可以从简单的二元数据集的预训练中出现。受到这个理论的启发,我们介绍了WSD-S,这是WSD的一种变体,它重复使用以前的检查点的衰减阶段,并仅保留一个主干,在其中我们从衰减的检查点恢复。WSD-S在单次运行中获得了多个语言模型检查点,在参数从0.1B到1.2B的情况下,在各种计算预算下实现了优于WSD和Cyclic-Cosine的性能。
  • 图表
  • 解决问题
    WSD-S: A Warmup-Stable-Decay Schedule for Training Large Language Models
  • 关键思路
    The Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can continue indefinitely without a pre-specified compute budget. Then, given any compute budget, one can branch out from the main branch at a proper time with a rapidly decaying learning rate to produce a strong model. WSD-S is a variant of WSD that reuses previous checkpoints' decay phases and keeps only one main branch, where we resume from a decayed checkpoint.
  • 其它亮点
    The paper proposes a new training schedule for large language models that can continue indefinitely without a pre-specified compute budget. The schedule, called WSD-S, outperforms WSD and Cyclic-Cosine in obtaining multiple language model checkpoints across various compute budgets in a single run for parameters scaling from 0.1B to 1.2B. The paper also introduces the concept of a river valley landscape in pretraining loss and shows that the sustained high learning rate phase and fast decaying phase are both critical for optimization progress. The experiments are conducted on a simple bi-gram dataset and the code is open-sourced.
  • 相关研究
    Some related works in this field include 'Attention Is All You Need' by Vaswani et al., 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding' by Devlin et al., and 'GPT-2: Language Models are Unsupervised Multitask Learners' by Radford et al.
PDF
原文
点赞 收藏 评论 分享到Link

沙发等你来抢

去评论