SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

M Ryabinin, T Dettmers, M Diskin, A Borzunov
[HSE University & Yandex & University of Washington]

SWARM Parallelism: 大型模型训练的去中心化并行算法

要点:

  1. 分析现有的模型并行训练技术,提出分布式训练的“方块定律”;
  2. 开发SWARM并行算法,一种去中心化模型并行算法,基于随机容错管道和动态再平衡;
  3. 结合SWARM并行性和压缩策略,以低网络带宽训练大型 Transformer 语言模型。

一句话总结:
提出 SWARM 并行算法,一种分散的模型并行算法,用于在不可靠和低网络带宽的设备上训练大型模型,利用随机容错管道和动态再平衡来提高通信效率。

摘要:
许多深度学习应用从使用具有数十亿参数的大型模型中受益。由于需要专门的 HPC 集群,训练这些模型众所周知非常昂贵。本文考虑训练大型模型的替代设置:用廉价的"可抢占"实例或汇集多个地区的现有资源。本文分析了现有模型并行算法在这些条件下的性能,并找到了训练大型模型的通信密集度较低的配置。基于这些发现,本文提出了 SWARM 并行性,一种为连接不畅、异构和不可靠设备设计的模型并行训练算法。SWARM 在节点之间创建了临时的随机管道,在发生故障时,这些管道会被重新平衡。本文从经验上验证了所述发现,并将 SWARM 的并行性与现有大规模训练方法进行比较。本文将所述见解与压缩策略结合起来,在网速低于 200Mb/s 的可抢占 T4 GPU 上,用 1B 共享参数(共享前约13B)训练一个大型的 Transformer 语言模型。

论文:https://arxiv.org/abs/2301.11913

Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions. We analyze the performance of existing model-parallel algorithms in these conditions and find configurations where training larger models becomes less communication-intensive. Based on these findings, we propose SWARM parallelism, a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices. SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure. We empirically validate our findings and compare SWARM parallelism with existing large-scale training approaches. Finally, we combine our insights with compression strategies to train a large Transformer language model with 1B shared parameters (approximately 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network.

图片
图片
图片
图片