- 简介我们提出了Adam-mini优化器,其内存占用比AdamW少45%到50%,并且性能相当甚至更好。Adam-mini通过减少Adam中的学习率资源(即$1/\sqrt{v}$)来降低内存使用率。我们发现,如果我们(1)按照我们提出的Hessian结构原则仔细地将参数分成块;(2)为每个参数块分配一个单一但好的学习率,那么$v$中$\geq$ 90%的这些学习率可以被无害地去除。我们进一步发现,对于这些参数块中的每一个,存在一个高质量的单一学习率,如果有足够的资源来搜索它,它可以胜过Adam。然后,我们提供了一种成本效益的方法来找到好的学习率,并提出了Adam-mini。在经验上,我们验证了Adam-mini在从125M到7B的各种语言模型上进行预训练、监督微调和RLHF时,性能与AdamW相当甚至更好。Adam-mini的减少的内存使用率还减轻了GPU和CPU之间的通信开销,从而增加了吞吐量。例如,在使用$2\times$ A800-80GB GPU对Llama2-7B进行预训练时,Adam-mini的吞吐量比AdamW高49.6%,这节省了33%的预训练时间。
- 图表
- 解决问题Adam-mini: Reducing Memory Footprint of Adam without Compromising Performance
- 关键思路The key idea of the paper is to reduce the memory footprint of Adam optimizer by removing the learning rate resources in Adam and assigning a single but good learning rate to each parameter block, which can outperform Adam if sufficient resources are available to search it out.
- 其它亮点The experiments show that Adam-mini performs on par or better than AdamW on various language models for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also increases throughput and saves wall-clock time for pre-training. The paper also provides a cost-effective way to find good learning rates. The code is open-sourced on GitHub.
- Related works include AdamW optimizer, which is an improved version of Adam, and other memory-efficient optimizers such as LAMB and Adafactor.
沙发等你来抢
去评论
评论
沙发等你来抢