Adam-mini: Use Fewer Learning Rates To Gain More

简介

我们提出了Adam-mini优化器，其内存占用比AdamW少45%到50%，并且性能相当甚至更好。Adam-mini通过减少Adam中的学习率资源（即$1/\sqrt{v}$）来降低内存使用率。我们发现，如果我们（1）按照我们提出的Hessian结构原则仔细地将参数分成块；（2）为每个参数块分配一个单一但好的学习率，那么$v$中$\geq$ 90%的这些学习率可以被无害地去除。我们进一步发现，对于这些参数块中的每一个，存在一个高质量的单一学习率，如果有足够的资源来搜索它，它可以胜过Adam。然后，我们提供了一种成本效益的方法来找到好的学习率，并提出了Adam-mini。在经验上，我们验证了Adam-mini在从125M到7B的各种语言模型上进行预训练、监督微调和RLHF时，性能与AdamW相当甚至更好。Adam-mini的减少的内存使用率还减轻了GPU和CPU之间的通信开销，从而增加了吞吐量。例如，在使用$2\times$ A800-80GB GPU对Llama2-7B进行预训练时，Adam-mini的吞吐量比AdamW高49.6％，这节省了33％的预训练时间。
图表
解决问题

Adam-mini: Reducing Memory Footprint of Adam without Compromising Performance
关键思路

The key idea of the paper is to reduce the memory footprint of Adam optimizer by removing the learning rate resources in Adam and assigning a single but good learning rate to each parameter block, which can outperform Adam if sufficient resources are available to search it out.
其它亮点

The experiments show that Adam-mini performs on par or better than AdamW on various language models for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also increases throughput and saves wall-clock time for pre-training. The paper also provides a cost-effective way to find good learning rates. The code is open-sourced on GitHub.
相关研究

Related works include AdamW optimizer, which is an improved version of Adam, and other memory-efficient optimizers such as LAMB and Adafactor.

Adam-mini: Use Fewer Learning Rates To Gain More

评论