FastPersist: Accelerating Model Checkpointing in Deep Learning

2024年06月19日
  • 简介
    模型检查点是深度学习(DL)的关键工件,可以为训练和推理等下游应用程序提供容错能力。然而,将检查点写入持久存储以及DL训练的其他I/O方面,大多数计算优化努力都忽略了,这些优化努力旨在加快快速增长的模型和数据集的训练。为了解决这个问题,我们提出了FastPersist来加速DL训练中的检查点创建。FastPersist结合了三种新技术:(i)用于更快地将检查点写入SSD的NVMe优化,(ii)利用训练环境中可用的SSD进行高效的写入并行处理,以及(iii)与独立训练计算重叠的检查点。我们使用真实世界的密集和稀疏DL模型进行评估,结果显示,FastPersist在持久存储中创建检查点的速度比基线快高达116倍,并且使每次迭代的检查点具有可忽略的开销。
  • 作者讲解
  • 图表
  • 解决问题
    FastPersist: Accelerating Checkpointing in Deep Learning Training
  • 关键思路
    FastPersist combines three novel techniques to accelerate checkpoint creation in DL training: NVMe optimizations, efficient write parallelism, and overlapping checkpointing with independent training computations.
  • 其它亮点
    FastPersist creates checkpoints in persistent storage up to 116x faster than baseline, and enables per-iteration checkpointing with negligible overhead. Real world dense and sparse DL models were used for evaluation.
  • 相关研究
    Related work includes optimization efforts for faster DL training, but mostly ignore I/O aspects such as checkpointing. No specific related research papers were mentioned.
许愿开讲
PDF
原文
点赞 收藏
向作者提问
NEW
分享到Link

提问交流

提交问题,平台邀请作者,轻松获得权威解答~

向作者提问