对于Checkpoint Manager的高性能CKPT读写的疑问 #165

yiqing0071 · 2025-03-13T03:14:39Z

Fire-Flyer AI-HPC论文中关于CheckPoint Manager 原文：

During the saving process, each tensor is recorded with its index and the offset within the checkpoint., which makes the location of tensors more convenient during the loading process. With the 3FS batch read API, a loading process can be completed in just a few seconds.

想问一下，这个秒级读入，除了CKPT本身的分块批量读出，在计算节点聚合以外，还有什么技巧么？
另外，这个秒级读出的指标，相应的参数量和单节点带宽是多少？

wangyibin-gh · 2025-03-13T08:45:23Z

let me take a guess. the checkpoint file is quite large, which is stored in multiple chunks and each chunk is replicated on different storage target. During AI training, when a failure happens and each GPU needs to read its own part of sensors/params within the checkpoint file. And since the tensor index/offset can be easily calculated, and the checkpoint file meta shows where the desired tensor data is stored, the load process knows exactly where to read it. It just read the part it is interested in, rather than the whole file. That is, a sip of water out of the ocean. it should definitely be really fast rather than slow.

yiqing0071 · 2025-03-14T01:14:09Z

let me take a guess. the checkpoint file is quite large, which is stored in multiple chunks and each chunk is replicated on different storage target. During AI training, when a failure happens and each GPU needs to read its own part of sensors/params within the checkpoint file. And since the tensor index/offset can be easily calculated, and the checkpoint file meta shows where the desired tensor data is stored, the load process knows exactly where to read it. It just read the part it is interested in, rather than the whole file. That is, a sip of water out of the ocean. it should definitely be really fast rather than slow.

This is reasonable. But it doesn't really solve the issue in the scenarios where a whole checkpoint file needs to be loaded. Say, when some hardware failure happens and the training needs to be restarted from some point.
From the paper, it seems like the loading of the whole file is within seconds, isn't it?

wangyibin-gh · 2025-03-17T03:53:57Z

let me take a guess. the checkpoint file is quite large, which is stored in multiple chunks and each chunk is replicated on different storage target. During AI training, when a failure happens and each GPU needs to read its own part of sensors/params within the checkpoint file. And since the tensor index/offset can be easily calculated, and the checkpoint file meta shows where the desired tensor data is stored, the load process knows exactly where to read it. It just read the part it is interested in, rather than the whole file. That is, a sip of water out of the ocean. it should definitely be really fast rather than slow.

This is reasonable. But it doesn't really solve the issue in the scenarios where a whole checkpoint file needs to be loaded. Say, when some hardware failure happens and the training needs to be restarted from some point. From the paper, it seems like the loading of the whole file is within seconds, isn't it?

if there's such case, the client host has 400Gbps NIC which can theoretically get more than 40GB/s aggregate bandwidth to 3FS storge servers. It's fairly easy for a checkpoint file to be read within seconds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

对于Checkpoint Manager的高性能CKPT读写的疑问 #165

对于Checkpoint Manager的高性能CKPT读写的疑问 #165

yiqing0071 commented Mar 13, 2025

wangyibin-gh commented Mar 13, 2025

yiqing0071 commented Mar 14, 2025

wangyibin-gh commented Mar 17, 2025

对于Checkpoint Manager的高性能CKPT读写的疑问 #165

对于Checkpoint Manager的高性能CKPT读写的疑问 #165

Comments

yiqing0071 commented Mar 13, 2025

wangyibin-gh commented Mar 13, 2025

yiqing0071 commented Mar 14, 2025

wangyibin-gh commented Mar 17, 2025