Skip to content

Commit 515bc9d

Browse files
committed
fix id change bug
1 parent f1aa9a4 commit 515bc9d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+290
-1754
lines changed

README.md

+13-19
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ This repo contains offical PyTorch model definitions, pre-trained weights and in
4242

4343

4444
## 🔥🔥🔥 News!!
45+
* Mar 07, 2025: 🔥 We have fixed the bug in our open-source version that caused ID changes. Please try the new model weights of [HunyuanVideo-I2V](https://huggingface.co/tencent/HunyuanVideo-I2V) to ensure full visual consistency in the first frame and produce higher quality videos.
4546
* Mar 06, 2025: 👋 We release the inference code and model weights of HunyuanVideo-I2V. [Download](https://github.com/Tencent/HunyuanVideo-I2V/blob/main/ckpts/README.md).
4647

4748

@@ -53,7 +54,12 @@ This repo contains offical PyTorch model definitions, pre-trained weights and in
5354
<p>Co-creator @D-aiY Director Ding Yi</p>
5455
</div>
5556

56-
### Customizable I2V LoRA Demo
57+
### Frist Frame Consistency Demo
58+
| Reference Image | Generated Video |Reference Image | Generated Video |Reference Image | Generated Video |
59+
|:----------------:|:----------------:|:----------------:|:----------------:|:----------------:|:----------------:|
60+
| <img src="https://github.com/user-attachments/assets/83e7a097-ffca-40db-9c72-be01d866aa7d" width="80%"> | <video src="https://github.com/user-attachments/assets/f81d2c88-bb1a-43f8-b40f-1ccc20774563" width="100%"> </video> | <img src="https://github.com/user-attachments/assets/c385a11f-60c7-4919-b0f1-bc5e715f673c" width="50%"> | <video src="https://github.com/user-attachments/assets/0c29ede9-0481-4d40-9c67-a4b6267fdc2d" width="100%"> </video> | <img src="https://github.com/user-attachments/assets/5763f5eb-0be5-4b36-866a-5199e31c5802" width="95%"> | <video src="https://github.com/user-attachments/assets/a8da0a1b-ba7d-45a4-a901-5d213ceaf50e" width="100%"> </video> |
61+
62+
<!-- ### Customizable I2V LoRA Demo
5763
5864
| I2V Lora Effect | Reference Image | Generated Video |
5965
|:---------------:|:--------------------------------:|:----------------:|
@@ -74,16 +80,16 @@ This repo contains offical PyTorch model definitions, pre-trained weights and in
7480
- Enhance-A-Video (Better Generated Video for Free): [Enhance-A-Video](https://github.com/NUS-HPC-AI-Lab/Enhance-A-Video) by [NUS-HPC-AI-Lab](https://ai.comp.nus.edu.sg/)
7581
- TeaCache (Cache-based Accelerate): [TeaCache](https://github.com/LiewFeng/TeaCache) by [Feng Liu](https://github.com/LiewFeng)
7682
- HunyuanVideoGP (GPU Poor version): [HunyuanVideoGP](https://github.com/deepbeepmeep/HunyuanVideoGP) by [DeepBeepMeep](https://github.com/deepbeepmeep)
77-
-->
83+
-->
7884

7985

8086

8187
## 📑 Open-source Plan
8288
- HunyuanVideo-I2V (Image-to-Video Model)
83-
- [x] Lora training scripts
8489
- [x] Inference
8590
- [x] Checkpoints
8691
- [x] ComfyUI
92+
- [ ] Lora training scripts
8793
- [ ] Multi-gpus Sequence Parallel inference (Faster inference speed on more gpus)
8894
- [ ] Diffusers
8995
- [ ] FP8 Quantified weight
@@ -93,7 +99,7 @@ This repo contains offical PyTorch model definitions, pre-trained weights and in
9399
- [🔥🔥🔥 News!!](#-news)
94100
- [🎥 Demo](#-demo)
95101
- [I2V Demo](#i2v-demo)
96-
- [Customizable I2V LoRA Demo](#customizable-i2v-lora-demo)
102+
- [Frist Frame Consistency Demo](#frist-frame-consistency-demo)
97103
- [📑 Open-source Plan](#-open-source-plan)
98104
- [Contents](#contents)
99105
- [**HunyuanVideo-I2V Overall Architecture**](#hunyuanvideo-i2v-overall-architecture)
@@ -105,18 +111,12 @@ This repo contains offical PyTorch model definitions, pre-trained weights and in
105111
- [Tips for Using Image-to-Video Models](#tips-for-using-image-to-video-models)
106112
- [Using Command Line](#using-command-line)
107113
- [More Configurations](#more-configurations)
108-
- [🎉 Customizable I2V LoRA effects training](#-customizable-i2v-lora-effects-training)
109-
- [Requirements](#requirements)
110-
- [Environment](#environment)
111-
- [Training data construction](#training-data-construction)
112-
- [Training](#training)
113-
- [Inference](#inference)
114114
- [🔗 BibTeX](#-bibtex)
115115
- [Acknowledgements](#acknowledgements)
116116
---
117117

118118
## **HunyuanVideo-I2V Overall Architecture**
119-
Leveraging the advanced video generation capabilities of [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), we have extended its application to image-to-video generation tasks. To achieve this, we employ an image latent concatenation technique to effectively reconstruct and incorporate reference image information into the video generation process.
119+
Leveraging the advanced video generation capabilities of [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), we have extended its application to image-to-video generation tasks. To achieve this, we employ a token replace technique to effectively reconstruct and incorporate reference image information into the video generation process.
120120

121121
Since we utilizes a pre-trained Multimodal Large Language Model (MLLM) with a Decoder-Only architecture as the text encoder, we can significantly enhance the model's ability to comprehend the semantic content of the input image and to seamlessly integrate information from both the image and its associated caption. Specifically, the input image is processed by the MLLM to generate semantic image tokens. These tokens are then concatenated with the video latent tokens, enabling comprehensive full-attention computation across the combined data.
122122

@@ -212,12 +212,6 @@ Similar to [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), HunyuanVideo
212212
- **Camera Angle (Optional)**: Indicate the perspective or viewpoint.
213213
- **Avoid Overly Detailed Prompts**: Lengthy or highly detailed prompts can lead to unnecessary transitions in the video output.
214214

215-
For example:
216-
1. A man with short gray hair plays a red electric guitar.
217-
2. A woman sits on a wooden floor, holding a colorful bag.
218-
3. A bee flaps its wings. The camera movement is Zoom Out/Zoom In/Pan Right.
219-
4. A little boy closes his mouth, stands up, and lifts his left hand. The background is blurred.
220-
221215
<!-- **For image-to-video models, we recommend using concise prompts to guide the model's generation process. A good prompt should include elements such as background, main subject, action, and camera angle. Overly long or excessively detailed prompts may introduce unnecessary transitions.** -->
222216

223217
### Using Command Line
@@ -266,7 +260,7 @@ We list some more useful configurations for easy usage:
266260
| `--save-path` | ./results | Path to save the generated video. |
267261

268262

269-
## 🎉 Customizable I2V LoRA effects training
263+
<!-- ## 🎉 Customizable I2V LoRA effects training
270264
271265
### Requirements
272266
@@ -336,7 +330,7 @@ We list some lora specific configurations for easy usage:
336330
|:-------------------:|:-------:|:----------------------------:|
337331
| `--use-lora` | False | Whether to open lora mode. |
338332
| `--lora-scale` | 1.0 | Fusion scale for lora model. |
339-
| `--lora-path` | "" | Weight path for lora model. |
333+
| `--lora-path` | "" | Weight path for lora model. | -->
340334

341335

342336
## 🔗 BibTeX

README_zh.md

+13-13
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@
3434
> [**HunyuanVideo: A Systematic Framework For Large Video Generation Model**](https://arxiv.org/abs/2412.03603)
3535
3636
## 🔥🔥🔥 最新动态
37+
* 2025年3月7日: 🔥 我们已经修复了开源版本中导致ID变化的bug,请尝试[HunyuanVideo-I2V](https://huggingface.co/tencent/HunyuanVideo-I2V)新的模型权重,以确保首帧完全视觉一致性,并制作更高质量的视频。
3738
* 2025年3月6日: 👋 发布HunyuanVideo-I2V的推理代码和模型权重。[下载地址](https://github.com/Tencent/HunyuanVideo-I2V/blob/main/ckpts/README.md)
3839

3940
## 🎥 演示
@@ -43,19 +44,24 @@
4344
<p>联合创作 @D-aiY 导演 丁一</p>
4445
</div>
4546

46-
### 定制化I2V LoRA效果演示
47+
### 首帧一致性示例
48+
| 参考图 | 生成视频 |参考图 | 生成视频 |参考图 | 生成视频 |
49+
|:----------------:|:----------------:|:----------------:|:----------------:|:----------------:|:----------------:|
50+
| <img src="https://github.com/user-attachments/assets/83e7a097-ffca-40db-9c72-be01d866aa7d" width="80%"> | <video src="https://github.com/user-attachments/assets/f81d2c88-bb1a-43f8-b40f-1ccc20774563" width="100%"> </video> | <img src="https://github.com/user-attachments/assets/c385a11f-60c7-4919-b0f1-bc5e715f673c" width="50%"> | <video src="https://github.com/user-attachments/assets/0c29ede9-0481-4d40-9c67-a4b6267fdc2d" width="100%"> </video> | <img src="https://github.com/user-attachments/assets/5763f5eb-0be5-4b36-866a-5199e31c5802" width="95%"> | <video src="https://github.com/user-attachments/assets/a8da0a1b-ba7d-45a4-a901-5d213ceaf50e" width="100%"> </video> |
51+
52+
<!-- ### 定制化I2V LoRA效果演示
4753
4854
| 特效类型 | 参考图像 | 生成视频 |
4955
|:---------------:|:--------------------------------:|:----------------:|
5056
| 头发生长 | <img src="./assets/demo/i2v_lora/imgs/hair_growth.png" width="40%"> | <video src="https://github.com/user-attachments/assets/06b998ae-bbde-4c1f-96cb-a25a9197d5cb" width="100%"> </video> |
51-
| 拥抱 | <img src="./assets/demo/i2v_lora/imgs/embrace.png" width="40%"> | <video src="https://github.com/user-attachments/assets/f8c99eb1-2a43-489a-ba02-6bd50a6dd260" width="100%" > </video> |
57+
| 拥抱 | <img src="./assets/demo/i2v_lora/imgs/embrace.png" width="40%"> | <video src="https://github.com/user-attachments/assets/f8c99eb1-2a43-489a-ba02-6bd50a6dd260" width="100%" > </video> | -->
5258

5359
## 📑 开源计划
5460
- HunyuanVideo-I2V(图像到视频模型)
55-
- [x] LoRA训练脚本
5661
- [x] 推理代码
5762
- [x] 模型权重
5863
- [x] ComfyUI支持
64+
- [ ] LoRA训练脚本
5965
- [ ] 多GPU序列并行推理(提升多卡推理速度)
6066
- [ ] Diffusers集成
6167
- [ ] FP8量化权重
@@ -65,7 +71,7 @@
6571
- [🔥🔥🔥 最新动态](#-最新动态)
6672
- [🎥 演示](#-演示)
6773
- [I2V 示例](#i2v-示例)
68-
- [定制化I2V LoRA效果演示](#定制化i2v-lora效果演示)
74+
- [首帧一致性示例](#首帧一致性示例)
6975
- [📑 开源计划](#-开源计划)
7076
- [目录](#目录)
7177
- [**HunyuanVideo-I2V 整体架构**](#hunyuanvideo-i2v-整体架构)
@@ -77,19 +83,13 @@
7783
- [使用图生视频模型的建议](#使用图生视频模型的建议)
7884
- [使用命令行](#使用命令行)
7985
- [更多配置](#更多配置)
80-
- [🎉自定义 I2V LoRA 效果训练](#自定义-i2v-lora-效果训练)
81-
- [要求](#要求)
82-
- [训练环境](#训练环境)
83-
- [训练数据构建](#训练数据构建)
84-
- [开始训练](#开始训练)
85-
- [推理](#推理)
8686
- [🔗 BibTeX](#-bibtex)
8787
- [致谢](#致谢)
8888

8989
---
9090

9191
## **HunyuanVideo-I2V 整体架构**
92-
基于[HunyuanVideo](https://github.com/Tencent/HunyuanVideo)强大的视频生成能力,我们将其扩展至图像到视频生成任务。为此,我们采用图像隐空间拼接技术,有效重构并融合参考图像信息至视频生成流程中。
92+
基于[HunyuanVideo](https://github.com/Tencent/HunyuanVideo)强大的视频生成能力,我们将其扩展至图像到视频生成任务。为此,我们采用首帧Token替换方案,有效重构并融合参考图像信息至视频生成流程中。
9393

9494
由于我们使用预训练的Decoder-Only架构多模态大语言模型(MLLM)作为文本编码器,可用于显著增强模型对输入图像语义内容的理解能力,并实现图像与文本描述信息的深度融合。具体而言,输入图像经MLLM处理后生成语义图像tokens,这些tokens与视频隐空间tokens拼接,实现跨模态的全注意力计算。
9595

@@ -224,7 +224,7 @@ python3 sample_image2video.py \
224224
| `--save-path` | ./results | 保存生成视频的路径。 |
225225

226226

227-
## 🎉自定义 I2V LoRA 效果训练
227+
<!-- ## 🎉自定义 I2V LoRA 效果训练
228228
229229
### 要求
230230
@@ -292,7 +292,7 @@ python3 sample_image2video.py \
292292
|:-------------------:|:-------:|:----------------------------:|
293293
| `--use-lora` | None | 是否开启 LoRA 模式。 |
294294
| `--lora-scale` | 1.0 | LoRA 模型的融合比例。 |
295-
| `--lora-path` | "" | LoRA 模型的权重路径。 |
295+
| `--lora-path` | "" | LoRA 模型的权重路径。 | -->
296296

297297
## 🔗 BibTeX
298298

assets/backbone.png

-69 KB
Loading

assets/demo/i2v/imgs/0.jpg

401 KB
Loading

assets/demo/i2v/imgs/0.png

-309 KB
Binary file not shown.

assets/demo/i2v/imgs/1.png

8.01 MB
Loading

assets/demo/i2v/imgs/2.png

-541 KB
Binary file not shown.

assets/demo/i2v/imgs/3.png

-663 KB
Binary file not shown.

assets/demo/i2v/imgs/4.png

-1.84 MB
Binary file not shown.

assets/demo/i2v/videos/0.mp4

761 KB
Binary file not shown.

assets/demo/i2v/videos/1.mp4

343 KB
Binary file not shown.

assets/demo/i2v/videos/2.mp4

-566 KB
Binary file not shown.

assets/demo/i2v/videos/3.mp4

-1.64 MB
Binary file not shown.

assets/demo/i2v/videos/4.mp4

-1010 KB
Binary file not shown.

assets/demo/i2v_lora/imgs/embrace.png

-4.68 MB
Binary file not shown.
-2.22 MB
Binary file not shown.

assets/demo/i2v_lora/train_dataset/meta_data.json

-6
This file was deleted.

assets/demo/i2v_lora/train_dataset/meta_file.list

-1
This file was deleted.
Binary file not shown.

assets/demo/i2v_lora/train_dataset/processed_data/json_path/embrace.json

-1
This file was deleted.
-1.06 MB
Binary file not shown.
-412 KB
Binary file not shown.

assets/overall.png

-1.22 MB
Binary file not shown.

demo/0.jpg

401 KB
Loading

demo/1.jpg

175 KB
Loading

demo/2.png

2.32 MB
Loading

demo/3.png

1.4 MB
Loading

demo/4.png

3.13 MB
Loading

demo/5.jpg

107 KB
Loading

hyvideo/config.py

+24-3
Original file line numberDiff line numberDiff line change
@@ -504,15 +504,36 @@ def add_i2v_args(parser: argparse.ArgumentParser):
504504
group = parser.add_argument_group(title="I2V args")
505505

506506
group.add_argument(
507-
"--i2v-mode", action="store_true", help="Whether to open i2v mode."
507+
"--i2v-mode",
508+
action="store_true",
509+
help="Whether to open i2v mode."
510+
)
511+
512+
group.add_argument(
513+
"--i2v-resolution",
514+
type=str,
515+
default="720p",
516+
choices=["720p", "540p", "360p"],
517+
help="Resolution for i2v inference."
508518
)
509519

510520
group.add_argument(
511-
"--i2v-resolution", type=str, default="720p", choices=["720p", "540p", "360p"], help="Resolution for i2v inference."
521+
"--i2v-image-path",
522+
type=str,
523+
default="./assets/demo/i2v/imgs/0.png",
524+
help="Image path for i2v inference."
525+
)
526+
527+
group.add_argument(
528+
"--i2v-condition-type",
529+
type=str,
530+
default="token_replace",
531+
choices=["token_replace", "latent_concat"],
532+
help="Condition type for i2v model."
512533
)
513534

514535
group.add_argument(
515-
"--i2v-image-path", type=str, default="./assets/demo/i2v/imgs/0.png", help="Image path for i2v inference."
536+
"--i2v-stability", action="store_true", help="Whether to use i2v stability mode."
516537
)
517538

518539
return parser

hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py

+25-8
Original file line numberDiff line numberDiff line change
@@ -576,8 +576,10 @@ def prepare_latents(
576576
latents=None,
577577
img_latents=None,
578578
i2v_mode=False,
579+
i2v_condition_type=None,
580+
i2v_stability=True,
579581
):
580-
if i2v_mode:
582+
if i2v_mode and i2v_condition_type == "latent_concat":
581583
num_channels_latents = (num_channels_latents - 1) // 2
582584
shape = (
583585
batch_size,
@@ -592,7 +594,7 @@ def prepare_latents(
592594
f" size of {batch_size}. Make sure the batch size matches the length of the generators."
593595
)
594596

595-
if i2v_mode:
597+
if i2v_mode and i2v_stability:
596598
if img_latents.shape[2] == 1:
597599
img_latents = img_latents.repeat(1, 1, video_length, 1, 1)
598600
x0 = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
@@ -722,6 +724,8 @@ def __call__(
722724
n_tokens: Optional[int] = None,
723725
embedded_guidance_scale: Optional[float] = None,
724726
i2v_mode: bool = False,
727+
i2v_condition_type: str = None,
728+
i2v_stability: bool = True,
725729
img_latents: Optional[torch.Tensor] = None,
726730
semantic_images=None,
727731
**kwargs,
@@ -963,9 +967,11 @@ def __call__(
963967
latents,
964968
img_latents=img_latents,
965969
i2v_mode=i2v_mode,
970+
i2v_condition_type=i2v_condition_type,
971+
i2v_stability=i2v_stability
966972
)
967973

968-
if i2v_mode:
974+
if i2v_mode and i2v_condition_type == "latent_concat":
969975
if img_latents.shape[2] == 1:
970976
img_latents_concat = img_latents.repeat(1, 1, video_length, 1, 1)
971977
else:
@@ -1004,8 +1010,11 @@ def __call__(
10041010
if self.interrupt:
10051011
continue
10061012

1013+
if i2v_mode and i2v_condition_type == "token_replace":
1014+
latents = torch.concat([img_latents, latents[:, :, 1:, :, :]], dim=2)
1015+
10071016
# expand the latents if we are doing classifier free guidance
1008-
if i2v_mode:
1017+
if i2v_mode and i2v_condition_type == "latent_concat":
10091018
latent_model_input = torch.concat([latents, img_latents_concat, mask_concat], dim=1)
10101019
else:
10111020
latent_model_input = latents
@@ -1066,9 +1075,17 @@ def __call__(
10661075
)
10671076

10681077
# compute the previous noisy sample x_t -> x_t-1
1069-
latents = self.scheduler.step(
1070-
noise_pred, t, latents, **extra_step_kwargs, return_dict=False
1071-
)[0]
1078+
if i2v_mode and i2v_condition_type == "token_replace":
1079+
latents = self.scheduler.step(
1080+
noise_pred[:, :, 1:, :, :], t, latents[:, :, 1:, :, :], **extra_step_kwargs, return_dict=False
1081+
)[0]
1082+
latents = torch.concat(
1083+
[img_latents, latents], dim=2
1084+
)
1085+
else:
1086+
latents = self.scheduler.step(
1087+
noise_pred, t, latents, **extra_step_kwargs, return_dict=False
1088+
)[0]
10721089

10731090
if callback_on_step_end is not None:
10741091
callback_kwargs = {}
@@ -1139,7 +1156,7 @@ def __call__(
11391156
# we always cast to float32 as this does not cause significant overhead and is compatible with bfloa16
11401157
image = image.cpu().float()
11411158

1142-
if i2v_mode:
1159+
if i2v_mode and i2v_condition_type == "latent_concat":
11431160
image = image[:, :, 4:, :, :]
11441161

11451162
# Offload all models

hyvideo/ds_config.py

-63
This file was deleted.

0 commit comments

Comments
 (0)