- 简介模型检查点是深度学习(DL)的关键工件,可以为训练和推理等下游应用程序提供容错能力。然而,将检查点写入持久存储以及DL训练的其他I/O方面,大多数计算优化努力都忽略了,这些优化努力旨在加快快速增长的模型和数据集的训练。为了解决这个问题,我们提出了FastPersist来加速DL训练中的检查点创建。FastPersist结合了三种新技术:(i)用于更快地将检查点写入SSD的NVMe优化,(ii)利用训练环境中可用的SSD进行高效的写入并行处理,以及(iii)与独立训练计算重叠的检查点。我们使用真实世界的密集和稀疏DL模型进行评估,结果显示,FastPersist在持久存储中创建检查点的速度比基线快高达116倍,并且使每次迭代的检查点具有可忽略的开销。
-
- 图表
- 解决问题FastPersist: Accelerating Checkpointing in Deep Learning Training
- 关键思路FastPersist combines three novel techniques to accelerate checkpoint creation in DL training: NVMe optimizations, efficient write parallelism, and overlapping checkpointing with independent training computations.
- 其它亮点FastPersist creates checkpoints in persistent storage up to 116x faster than baseline, and enables per-iteration checkpointing with negligible overhead. Real world dense and sparse DL models were used for evaluation.
- Related work includes optimization efforts for faster DL training, but mostly ignore I/O aspects such as checkpointing. No specific related research papers were mentioned.
NEW
提问交流
提交问题,平台邀请作者,轻松获得权威解答~
向作者提问

提问交流