Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

对于Checkpoint Manager的高性能CKPT读写的疑问 #165

Open
yiqing0071 opened this issue Mar 13, 2025 · 3 comments
Open

对于Checkpoint Manager的高性能CKPT读写的疑问 #165

yiqing0071 opened this issue Mar 13, 2025 · 3 comments

Comments

@yiqing0071
Copy link

Fire-Flyer AI-HPC论文中关于CheckPoint Manager 原文:

During the saving process, each tensor is recorded with its index and the offset within the checkpoint., which makes the location of tensors more convenient during the loading process. With the 3FS batch read API, a loading process can be completed in just a few seconds.

想问一下,这个秒级读入,除了CKPT本身的分块批量读出,在计算节点聚合以外,还有什么技巧么?
另外,这个秒级读出的指标,相应的参数量和单节点带宽是多少?

@wangyibin-gh
Copy link

let me take a guess. the checkpoint file is quite large, which is stored in multiple chunks and each chunk is replicated on different storage target. During AI training, when a failure happens and each GPU needs to read its own part of sensors/params within the checkpoint file. And since the tensor index/offset can be easily calculated, and the checkpoint file meta shows where the desired tensor data is stored, the load process knows exactly where to read it. It just read the part it is interested in, rather than the whole file. That is, a sip of water out of the ocean. it should definitely be really fast rather than slow.

@yiqing0071
Copy link
Author

let me take a guess. the checkpoint file is quite large, which is stored in multiple chunks and each chunk is replicated on different storage target. During AI training, when a failure happens and each GPU needs to read its own part of sensors/params within the checkpoint file. And since the tensor index/offset can be easily calculated, and the checkpoint file meta shows where the desired tensor data is stored, the load process knows exactly where to read it. It just read the part it is interested in, rather than the whole file. That is, a sip of water out of the ocean. it should definitely be really fast rather than slow.

This is reasonable. But it doesn't really solve the issue in the scenarios where a whole checkpoint file needs to be loaded. Say, when some hardware failure happens and the training needs to be restarted from some point.
From the paper, it seems like the loading of the whole file is within seconds, isn't it?

@wangyibin-gh
Copy link

let me take a guess. the checkpoint file is quite large, which is stored in multiple chunks and each chunk is replicated on different storage target. During AI training, when a failure happens and each GPU needs to read its own part of sensors/params within the checkpoint file. And since the tensor index/offset can be easily calculated, and the checkpoint file meta shows where the desired tensor data is stored, the load process knows exactly where to read it. It just read the part it is interested in, rather than the whole file. That is, a sip of water out of the ocean. it should definitely be really fast rather than slow.

This is reasonable. But it doesn't really solve the issue in the scenarios where a whole checkpoint file needs to be loaded. Say, when some hardware failure happens and the training needs to be restarted from some point. From the paper, it seems like the loading of the whole file is within seconds, isn't it?

if there's such case, the client host has 400Gbps NIC which can theoretically get more than 40GB/s aggregate bandwidth to 3FS storge servers. It's fairly easy for a checkpoint file to be read within seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants