FastPersist: Accelerating Model Checkpointing in Deep Learning

简介

模型检查点是深度学习（DL）的关键工件，可以为训练和推理等下游应用程序提供容错能力。然而，将检查点写入持久存储以及DL训练的其他I/O方面，大多数计算优化努力都忽略了，这些优化努力旨在加快快速增长的模型和数据集的训练。为了解决这个问题，我们提出了FastPersist来加速DL训练中的检查点创建。FastPersist结合了三种新技术：（i）用于更快地将检查点写入SSD的NVMe优化，（ii）利用训练环境中可用的SSD进行高效的写入并行处理，以及（iii）与独立训练计算重叠的检查点。我们使用真实世界的密集和稀疏DL模型进行评估，结果显示，FastPersist在持久存储中创建检查点的速度比基线快高达116倍，并且使每次迭代的检查点具有可忽略的开销。
作者讲解

目前尚无作者解读视频，你可点击下方【许愿开讲】按钮，许愿作者开讲~
图表
解决问题

FastPersist: Accelerating Checkpointing in Deep Learning Training
关键思路

FastPersist combines three novel techniques to accelerate checkpoint creation in DL training: NVMe optimizations, efficient write parallelism, and overlapping checkpointing with independent training computations.
其它亮点

FastPersist creates checkpoints in persistent storage up to 116x faster than baseline, and enables per-iteration checkpointing with negligible overhead. Real world dense and sparse DL models were used for evaluation.
相关研究

Related work includes optimization efforts for faster DL training, but mostly ignore I/O aspects such as checkpointing. No specific related research papers were mentioned.

FastPersist: Accelerating Model Checkpointing in Deep Learning

提问交流

提问交流