-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU Model Loading Issues - Issue & Code Fix #36
Comments
I changed inference.py > from_pretrained to load models sequentially (one GPU at a time), and that seemed to fix the problem. The log now shows the total load time being under a couple minutes instead of up to 20 min per model. I include the code update below and also created a pull request if you guys want to incorporate the fix into the main repo: Log:
Here's the new from_pretrained code if you want to incorporate it:
|
Thanks for your suggestion. In our test environment, multiple GPUs can load the model in parallel without encountering the problem mentioned in the issue. We suspect that this problem may occur on certain specific GPU or test environments. To be compatible with these situations, we have merged this solution into the main repository and appended it to our community contribution list. |
Good to hear and glad to help! |
@TianQi-777 Just wanted to make sure you were aware of my comment here: It should only be a good thing (only kicks in after the 192-frame limit) but just making sure you guys were aware it's there. And maybe good to credit thu-ml as well for the original discovery (although I still did some work to implement it for the repo here). |
It takes 10-20 min to load up torch, checkpoints, etc each when using 2 GPUs. The time grows with more GPUs. It otherwise only takes a couple minutes if it were 1 GPU. I suspect it's because of contention issues where all GPUs are trying to access the model files at the same time.
Example log
Look at lines for "Loading torch model" (10 minutes) and "Loading text encoder model" (20 minutes)
The text was updated successfully, but these errors were encountered: