-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training / Latent Extraction Bug #43
Comments
Thank you for your suggestion, we have fixed this problem according to your pull request. |
One other bug should be noted - the latent extraction does not handle files in different subfolders with the same name. There will only be one file.npy in the latents |
Getting multi-GPU issues with the training script as well. Testing with 4xH100 with "export HOST_GPU_NUM=4". It fails with file not found trying to open json_path/.ckpts in hyvideo/dataset/video_loader.py
|
Thank you for your suggestion, we assume the training files video_id (file name in our case) are unique. We will remind it in Readme |
Hi, we do not figure out any code will save ckpt into data_jsons_path. The path is only used for saving data. Do you make it as your ckpt save path? |
No I do not - this only happens with multi-gpu. Same code/configuration/files on single GPU runs fine without this error, so something is appending .ckpts to the json_path scan when parallelized |
Which command did you run to get this error? |
Inside the vae.yaml
Inside the .sh
|
Disregard - multi-gpu works fine after switching to this line in the sh file:
That should be made more clear in the readme - it's a huge time saver for training. Last question - is there a way to continue an interrupted training session? I thought I saw mention of it but I can't find the exact arguments to use. |
The dataset.py used in the latent extraction has a bug that returns 3 or 4 values in some cases instead of 5. This triggers another error in run.py that then throws an error for not enough values to unpack.
In dataset.py, find:
Replace with:
Find:
Replace with:
I created a pull request with these changes here:
#44
I also included a bonus script "setVideosTo129Frames.py" for automatically stretching all videos in a folder to 129 frames to be usable in the training set.
Lastly I also adjusted the default epochs from 100000 to 100. Who would ever finish 100000??? lol
Readme suggestion: Once the checkpoints are done with the safetensors, the checkpoints folder actually not browsable with Jupyter or other file browsing programs. To extract the files I had to do this:
The text was updated successfully, but these errors were encountered: