Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training / Latent Extraction Bug #43

Open
pftq opened this issue Mar 22, 2025 · 9 comments
Open

Training / Latent Extraction Bug #43

pftq opened this issue Mar 22, 2025 · 9 comments

Comments

@pftq
Copy link
Contributor

pftq commented Mar 22, 2025

The dataset.py used in the latent extraction has a bug that returns 3 or 4 values in some cases instead of 5. This triggers another error in run.py that then throws an error for not enough values to unpack.

In dataset.py, find:

 if len(batch_index) == 0:
    print("get video len=0, skip")
    return None, None, None, False

Replace with:

# 20250322 pftq: fixed to return 5 values for consistency and "not enough values to unpack" error
if len(batch_index) == 0:
    print(f"get video len=0, skip for {video_item['video_path']}")
    return None, video_item["videoid"], video_item["video_path"], video_item["prompt"], False

Find:

# Skip if exists
latent_save_path = Path(self.latent_cache_dir) / f"{video_item['videoid']}.npy"
if latent_save_path.exists():
    return None, None, False

Replace with:

# 20250322 pftq: fixed to return 5 values for consistency and "not enough values to unpack" error
# Skip if exists
latent_save_path = Path(self.latent_cache_dir) / f"{video_item['videoid']}.npy"
if latent_save_path.exists():
    return None, None, None, None, False

I created a pull request with these changes here:
#44

I also included a bonus script "setVideosTo129Frames.py" for automatically stretching all videos in a folder to 129 frames to be usable in the training set.

Lastly I also adjusted the default epochs from 100000 to 100. Who would ever finish 100000??? lol

group.add_argument("--epochs", type=int, default=100000, help="Number of epochs to train.")

Readme suggestion: Once the checkpoints are done with the safetensors, the checkpoints folder actually not browsable with Jupyter or other file browsing programs. To extract the files I had to do this:

cd log_EXP
cd [name of lora]
cd checkpoints
mv global_step* ..
@TianQi-777
Copy link
Collaborator

Thank you for your suggestion, we have fixed this problem according to your pull request.

@pftq
Copy link
Contributor Author

pftq commented Mar 27, 2025

One other bug should be noted - the latent extraction does not handle files in different subfolders with the same name.
For example:
a/file.mp4
b/file.mp4

There will only be one file.npy in the latents

@pftq
Copy link
Contributor Author

pftq commented Mar 27, 2025

Getting multi-GPU issues with the training script as well. Testing with 4xH100 with "export HOST_GPU_NUM=4". It fails with file not found trying to open json_path/.ckpts in hyvideo/dataset/video_loader.py

        for json_file in json_files:
            with open(f"{data_jsons_path}/{json_file}", 'r', encoding='utf-8-sig') as file:
                data = json.load(file)
``

@Changlin-Lee
Copy link
Collaborator

One other bug should be noted - the latent extraction does not handle files in different subfolders with the same name. For example: a/file.mp4 b/file.mp4

There will only be one file.npy in the latents

Thank you for your suggestion, we assume the training files video_id (file name in our case) are unique. We will remind it in Readme

@Changlin-Lee
Copy link
Collaborator

Getting multi-GPU issues with the training script as well. Testing with 4xH100 with "export HOST_GPU_NUM=4". It fails with file not found trying to open json_path/.ckpts in hyvideo/dataset/video_loader.py

        for json_file in json_files:
            with open(f"{data_jsons_path}/{json_file}", 'r', encoding='utf-8-sig') as file:
                data = json.load(file)
``

Hi, we do not figure out any code will save ckpt into data_jsons_path. The path is only used for saving data. Do you make it as your ckpt save path?

@pftq
Copy link
Contributor Author

pftq commented Mar 27, 2025

No I do not - this only happens with multi-gpu. Same code/configuration/files on single GPU runs fine without this error, so something is appending .ckpts to the json_path scan when parallelized

@Changlin-Lee
Copy link
Collaborator

No I do not - this only happens with multi-gpu. Same code/configuration/files on single GPU runs fine without this error, so something is appending .ckpts to the json_path scan when parallelized

Which command did you run to get this error?

@pftq
Copy link
Contributor Author

pftq commented Mar 27, 2025

export HOST_GPU_NUM=1
chmod +x ./hyvideo/hyvae_extract/start.sh
./hyvideo/hyvae_extract/start.sh
sh scripts/run_train_image2video_lora.sh

Inside the vae.yaml

vae_path: "./ckpts/hunyuan-video-i2v-720p/vae"
video_url_files: "./training/meta_file.list"
output_base_dir: "./latents"
sample_n_frames: 129
target_size: 
  - 480
  - 480
enable_multi_aspect_ratio: True
use_stride: True

Inside the .sh

...
params=" \
    --lr 1e-4 \
    --warmup-num-steps 200 \
    --global-seed 1024 \
    --tensorboard \
    --zero-stage 2 \
    --vae 884-16c-hy \
    --vae-precision fp16 \
    --vae-tiling \
    --denoise-type flow \
    --flow-reverse \
    --flow-shift 7.0 \
    --i2v-mode \
    --model HYVideo-T/2 \
    --video-micro-batch-size 1 \
    --gradient-checkpoint \
    --ckpt-every 300 \
    --embedded-cfg-scale 6.0 \
    --epochs 51 \
    --final-save \
    "
...

@pftq
Copy link
Contributor Author

pftq commented Mar 30, 2025

Disregard - multi-gpu works fine after switching to this line in the sh file:

# single node, multi gpu
deepspeed --include localhost:0,1,2,3,4,5,6,7 --master_addr "${CHIEF_IP}" \

That should be made more clear in the readme - it's a huge time saver for training.

Last question - is there a way to continue an interrupted training session? I thought I saw mention of it but I can't find the exact arguments to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants