Training / Latent Extraction Bug #43

pftq · 2025-03-22T08:45:53Z

The dataset.py used in the latent extraction has a bug that returns 3 or 4 values in some cases instead of 5. This triggers another error in run.py that then throws an error for not enough values to unpack.

In dataset.py, find:

 if len(batch_index) == 0:
    print("get video len=0, skip")
    return None, None, None, False

Replace with:

# 20250322 pftq: fixed to return 5 values for consistency and "not enough values to unpack" error
if len(batch_index) == 0:
    print(f"get video len=0, skip for {video_item['video_path']}")
    return None, video_item["videoid"], video_item["video_path"], video_item["prompt"], False

Find:

# Skip if exists
latent_save_path = Path(self.latent_cache_dir) / f"{video_item['videoid']}.npy"
if latent_save_path.exists():
    return None, None, False

Replace with:

# 20250322 pftq: fixed to return 5 values for consistency and "not enough values to unpack" error
# Skip if exists
latent_save_path = Path(self.latent_cache_dir) / f"{video_item['videoid']}.npy"
if latent_save_path.exists():
    return None, None, None, None, False

I created a pull request with these changes here:
#44

I also included a bonus script "setVideosTo129Frames.py" for automatically stretching all videos in a folder to 129 frames to be usable in the training set.

Lastly I also adjusted the default epochs from 100000 to 100. Who would ever finish 100000??? lol

group.add_argument("--epochs", type=int, default=100000, help="Number of epochs to train.")

Readme suggestion: Once the checkpoints are done with the safetensors, the checkpoints folder actually not browsable with Jupyter or other file browsing programs. To extract the files I had to do this:

cd log_EXP
cd [name of lora]
cd checkpoints
mv global_step* ..

The text was updated successfully, but these errors were encountered:

TianQi-777 · 2025-03-25T12:50:19Z

Thank you for your suggestion, we have fixed this problem according to your pull request.

pftq · 2025-03-27T04:08:41Z

One other bug should be noted - the latent extraction does not handle files in different subfolders with the same name.
For example:
a/file.mp4
b/file.mp4

There will only be one file.npy in the latents

pftq · 2025-03-27T05:16:36Z

Getting multi-GPU issues with the training script as well. Testing with 4xH100 with "export HOST_GPU_NUM=4". It fails with file not found trying to open json_path/.ckpts in hyvideo/dataset/video_loader.py

        for json_file in json_files:
            with open(f"{data_jsons_path}/{json_file}", 'r', encoding='utf-8-sig') as file:
                data = json.load(file)
``

Changlin-Lee · 2025-03-27T07:09:10Z

One other bug should be noted - the latent extraction does not handle files in different subfolders with the same name. For example: a/file.mp4 b/file.mp4

There will only be one file.npy in the latents

Thank you for your suggestion, we assume the training files video_id (file name in our case) are unique. We will remind it in Readme

Changlin-Lee · 2025-03-27T07:14:32Z

Getting multi-GPU issues with the training script as well. Testing with 4xH100 with "export HOST_GPU_NUM=4". It fails with file not found trying to open json_path/.ckpts in hyvideo/dataset/video_loader.py
        for json_file in json_files:
            with open(f"{data_jsons_path}/{json_file}", 'r', encoding='utf-8-sig') as file:
                data = json.load(file)
``

Hi, we do not figure out any code will save ckpt into data_jsons_path. The path is only used for saving data. Do you make it as your ckpt save path?

pftq · 2025-03-27T09:00:01Z

No I do not - this only happens with multi-gpu. Same code/configuration/files on single GPU runs fine without this error, so something is appending .ckpts to the json_path scan when parallelized

Changlin-Lee · 2025-03-27T09:19:47Z

No I do not - this only happens with multi-gpu. Same code/configuration/files on single GPU runs fine without this error, so something is appending .ckpts to the json_path scan when parallelized

Which command did you run to get this error?

pftq · 2025-03-27T19:20:00Z

export HOST_GPU_NUM=1
chmod +x ./hyvideo/hyvae_extract/start.sh
./hyvideo/hyvae_extract/start.sh
sh scripts/run_train_image2video_lora.sh

Inside the vae.yaml

vae_path: "./ckpts/hunyuan-video-i2v-720p/vae"
video_url_files: "./training/meta_file.list"
output_base_dir: "./latents"
sample_n_frames: 129
target_size: 
  - 480
  - 480
enable_multi_aspect_ratio: True
use_stride: True

Inside the .sh

...
params=" \
    --lr 1e-4 \
    --warmup-num-steps 200 \
    --global-seed 1024 \
    --tensorboard \
    --zero-stage 2 \
    --vae 884-16c-hy \
    --vae-precision fp16 \
    --vae-tiling \
    --denoise-type flow \
    --flow-reverse \
    --flow-shift 7.0 \
    --i2v-mode \
    --model HYVideo-T/2 \
    --video-micro-batch-size 1 \
    --gradient-checkpoint \
    --ckpt-every 300 \
    --embedded-cfg-scale 6.0 \
    --epochs 51 \
    --final-save \
    "
...

pftq · 2025-03-30T18:21:06Z

Disregard - multi-gpu works fine after switching to this line in the sh file:

# single node, multi gpu
deepspeed --include localhost:0,1,2,3,4,5,6,7 --master_addr "${CHIEF_IP}" \

That should be made more clear in the readme - it's a huge time saver for training.

Last question - is there a way to continue an interrupted training session? I thought I saw mention of it but I can't find the exact arguments to use.

pftq mentioned this issue Mar 22, 2025

Fixed "not enough values to unpack" error on training / latent extraction + Bonus script for auto-stretching videos to 129 frames for training. #44

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training / Latent Extraction Bug #43

Training / Latent Extraction Bug #43

pftq commented Mar 22, 2025 •

edited

Loading

TianQi-777 commented Mar 25, 2025

pftq commented Mar 27, 2025

pftq commented Mar 27, 2025

Changlin-Lee commented Mar 27, 2025

Changlin-Lee commented Mar 27, 2025

pftq commented Mar 27, 2025

Changlin-Lee commented Mar 27, 2025

pftq commented Mar 27, 2025

pftq commented Mar 30, 2025 •

edited

Loading

Training / Latent Extraction Bug #43

Training / Latent Extraction Bug #43

Comments

pftq commented Mar 22, 2025 • edited Loading

TianQi-777 commented Mar 25, 2025

pftq commented Mar 27, 2025

pftq commented Mar 27, 2025

Changlin-Lee commented Mar 27, 2025

Changlin-Lee commented Mar 27, 2025

pftq commented Mar 27, 2025

Changlin-Lee commented Mar 27, 2025

pftq commented Mar 27, 2025

pftq commented Mar 30, 2025 • edited Loading

pftq commented Mar 22, 2025 •

edited

Loading

pftq commented Mar 30, 2025 •

edited

Loading