[LTX Video] Full finetuning #272

dorpxam · 2025-02-26T17:01:17Z

Video models are coming one after another (recently Wan2.1) but they are becoming increasingly heavy and impossible to use on local machines with a simple decent GPU. I understand that this corresponds to research and that academics often have heavy training resources, but ultimately who is the production of short videos of a few seconds intended for?

I have used the LTX Video model quite a bit, which despite all its shortcomings is the only model that is pleasant to use. It can produce 10-second videos in 512x896 (a quarter of a 4K UHD video) respecting the division by 32 of the resolution on my small 16GB GPU, and this in one minute. And again, it is not so much the sampling that consumes VRAM, it is the VAE decoding. From now on, Hugging Face offers a free online decoding service.

There are some issues with the LTX Video model:

Training was done on poorly calibrated data. Sometimes generated videos have the "BBC" logo and often the produced video includes overlayed advertising inserts. It seems that the source of the training data is, in whole or in part, cable television.
Adherence to the prompt is very haphazard. Despite respecting the prompting rules, the best results come from the example prompts provided on the official page.
The official scientific paper clearly states that the largest quantity of videos used for training is less than 4 seconds long and decreases exponentially up to 30 seconds. Suffice to say that the amount of data of about 10 seconds is small, which has the impact of creating videos that fail as the duration increases.
In terms of human anatomy, it is clear that the best results are achieved on relatively static portraits, do not imagine making a 10-second video with a coherent movement of a complete body, it is a waste of time.

That is why I intend to try a fine-tuning of the model with finetrainers.

My goal is to finalize my video corpus whose overall characteristics are:

Resolution: 512x896 (portrait) 896x512 (landscape)
Duration: 10 seconds for all videos.
Source: 4K stock footage downscaled with precision and quality.
Type of video: mainly human in all possible cases: walking, dancing, yoga, sports, running, jumping and so on. Some close-up videos for details of the eyes and skin. Indoor videos, outdoor videos, in all situations including street scenes.

My question is rather simple? I have access to hundreds of hours of UHD videos, I am sorting and managing the prompting with the best current multimodal models.

In your opinion, what is the ideal quantity for this type of finetuning? If there is an ideal quantity of course!

a-r-r-o-w · 2025-02-26T18:33:42Z

Thanks for the detailed writeup!

Video models are coming one after another (recently Wan2.1) but they are becoming increasingly heavy and impossible to use on local machines with a simple decent GPU. I understand that this corresponds to research and that academics often have heavy training resources, but ultimately who is the production of short videos of a few seconds intended for?

Would like to quote one of my favourite youtube channels here, Two Minutes Papers, to say that we shouldn't be looking at what's available now, but instead imagine what would be possible two papers down the line. Sure, it's harder to run at the moment, but people will figure out a way to cleverly offload, reduce computation required, distill, and apply loads of clever tricks to make local inference faster. There is high demand for such models, so it's only a matter of time someone releases consumer-friendly models competitive against closed-source state-of-the-art. The same happened to the LLMs field and now we have people running better models than the OG ChatGPT world-killer-robots-going-to-take-over-my-job-oh-god-im-losing-my-mind-wtf-model on a local 2x Mac mini with decent decoding speed.

My question is rather simple? I have access to hundreds of hours of UHD videos, I am sorting and managing the prompting with the best current multimodal models.

In your opinion, what is the ideal quantity for this type of finetuning? If there is an ideal quantity of course!

This is actually a hard question to answer without properly studying the scaling laws for smaller video diffusion models like LTXVideo. To be quite honest, I can't provide an answer that would definitely be correct. Instead, I would recommend doing some hyperparameter ablations of your own and going through this paper for some good starting points. One things I've convincingly found to work better is the LTX first-frame image conditioning (the current finetrainers v0.0.1 release does not support it but #245 does), so if I were you, I'd definitely dig into that a bit.

From my experiments, I have not found any evidence for multi-resolution training to be helpful if you want to generate videos at a fixed resolution/frame count. So, if you're looking to finetune for a specific use-case, I'd suggest to train with same resolution/length videos in bulk. A decent starting point would be ~100-200 videos to try and see how much you can overfit without the model completely collapsing. Once you've gotten it to overfit on about ~30-50% with a low number of training steps (it doesn't have to be the exact videos but able to replicate parts of it), you can scale your dataset to 1000+ videos with the same hyperparameters (ideally, something like lr=1e-6 weight_decay=1e-3, optimizer=AdamW(0.9, 0.99), low-to-high gradual timestep sampling (i.e. initially training with timesteps only upto ~250, for say first 20% of training steps, then upto ~500 for next 20% training steps, and then 1000 for last 60% seems to work okay), small-to-large-to-mixed dataset iteration, and it will most likely learn the concepts decently. Probably not the most scientific way of doing it, but this has worked decently for my small scale experiments of generating tom & jerry videos using https://huggingface.co/datasets/Wild-Heart/Tom-and-Jerry-VideoGeneration-Dataset.

I haven't done any large scale training yet (> 1M+ image/video), so can't really answer optimal hyperparameter decisions for given dataset size if that's what you're going for -- I will eventually be able to after making our training scalable and maximizing our FLOP utilization to as much as I know how to. There is not really an incentive for us to do training at scale yet (maybe other than learning how to?) without bettering the infra (since others are going to complete the low hanging fruits anyway), but I'll try to write some guides about the learnings along the way.

dorpxam · 2025-02-26T22:28:18Z

Would like to quote one of my favourite youtube channels here, Two Minutes Papers, to say that we shouldn't be looking at what's available now, but instead imagine what would be possible two papers down the line. Sure, it's harder to run at the moment, but people will figure out a way to cleverly offload, reduce computation required, distill, and apply loads of clever tricks to make local inference faster. There is high demand for such models, so it's only a matter of time someone releases consumer-friendly models competitive against closed-source state-of-the-art. The same happened to the LLMs field and now we have people running better models than the OG ChatGPT world-killer-robots-going-to-take-over-my-job-oh-god-im-losing-my-mind-wtf-model on a local 2x Mac mini with decent decoding speed.

I agree ofcourse. I'm just sad to see that a model like the recent Wan2.1 comes with the slogan 'Consumer Friendly' with a minimum of 40GB VRAM for running the 14B model. While in the same time, the 1.3B isn't more efficient and flexible than the LTX model himself. A mix of competition and hype that makes one model hide another while its potential is not exploited. I remember the time of demos on the first personal computers, the goal was to push the limits of the hardware. Currently, the only limit that is pushed is the amount of power needed to run the models. And I'm talking about pure models. Distillation and quantization has its limits, especially in the field of videos. The impact of the quantization on the LTXV is clear when you use LoRAs.

This is actually a hard question to answer without properly studying the scaling laws for smaller video diffusion models like LTXVideo. To be quite honest, I can't provide an answer that would definitely be correct. Instead, I would recommend doing some hyperparameter ablations of your own and going through this paper for some good starting points. One things I've convincingly found to work better is the LTX first-frame image conditioning (the current finetrainers v0.0.1 release does not support it but #245 does), so if I were you, I'd definitely dig into that a bit.

Oh! I will read this paper (Towards Precise Scaling Laws for Video Diffusion Transformers) one or two times, that seem very interesting at first sight. I've read the section of Image Conditioning in the paper but I must admit that my level of understanding is limited at this level. If I understand the idea roughly, the principle is an equivalent of a keyframe, a bit like in 3D animation software? Anyway, all enhancements are welcome. However, I read, I don't know where, that Lightricks was going to release a version 1.0 of the LTXV model and that it will change the model significantly, which is why few people have started training (finetuning/LoRA). Maybe you are aware of this?

Anyway, no pressure, I am far from finished in the realization of the ideal corpus for this model. I still have several weeks of work, in particular to refine the prompts: I use the VideoLlama3-7B model here with a specialized system prompt. Unfortunately, the results are uneven sometimes, and it is almost impossible to obtain an 'a-la-LTXV' prompt directly with this model. So I apply an iterative method by generating several prompts (3-5 for now) with the vision models, these prompts will be re-injected into a heavier and more efficient model (pure LLM - I have to experiment for the choice of final model) to refine a finer definitive prompt following the LTXV rules.

From my experiments, I have not found any evidence for multi-resolution training to be helpful if you want to generate videos at a fixed resolution/frame count. So, if you're looking to finetune for a specific use-case, I'd suggest to train with same resolution/length videos in bulk.

So, for you, the model able to generalize beyond resolution? However, from experience, I can tell you that if I generate portrait videos using Lightricks' example prompts intended for landscape, the portrait video produced seems to be a truncated version of a landscape video (a cut without framing, with characters out of frame and not framed). Is this a model problem or simply a lack of training data, I don't know, but I assure you that the problem is very real. This is why I intended to start with two formats (portrait/landscape) in the same resolution. In any case, the videos I have exist in these two formats. If I eliminate one format or the other, I lose part of the corpus knowing that I am very demanding on the choice of videos and the final selection.

A decent starting point would be ~100-200 videos to try and see how much you can overfit without the model completely collapsing. Once you've gotten it to overfit on about ~30-50% with a low number of training steps (it doesn't have to be the exact videos but able to replicate parts of it), you can scale your dataset to 1000+ videos with the same hyperparameters (ideally, something like lr=1e-6 weight_decay=1e-3, optimizer=AdamW(0.9, 0.99), low-to-high gradual timestep sampling (i.e. initially training with timesteps only upto ~250, for say first 20% of training steps, then upto ~500 for next 20% training steps, and then 1000 for last 60% seems to work okay), small-to-large-to-mixed dataset iteration, and it will most likely learn the concepts decently. Probably not the most scientific way of doing it, but this has worked decently for my small scale experiments of generating tom & jerry videos using https://huggingface.co/datasets/Wild-Heart/Tom-and-Jerry-VideoGeneration-Dataset.

Very interesting. So, I think I will start to do experimentation before doing a full training. But, think that 1000+ video, even 10000 for example, it's only just less 28 hours of video (for 10 seconds videos). I had thought I might have to push to a hundred hours (or about 300k videos maybe more, 500k ?), that's why I asked the question. The problem lies mainly in the balance of the choice of videos, having several models of each type of video: different angles in the case of a walker for example. Static camera that follows the movement. Moving camera. Rotation around the character. There are many scenarios and the idea is to provide a corpus that is both solid, varied and balanced. As a result, the quantity of short videos (10 seconds) can climb quickly. Especially since my sources are longer, at least 30 seconds to several minutes, with different angles, which allows me to sometimes extract several sequences from the same source video. In terms of anatomy, diversity of gender, race, hair color, clothing ... I would like to have a complete palette hoping that this confirms the idea that the training data counts as much as the model itself.

I haven't done any large scale training yet (> 1M+ image/video), so can't really answer optimal hyperparameter decisions for given dataset size if that's what you're going for -- I will eventually be able to after making our training scalable and maximizing our FLOP utilization to as much as I know how to. There is not really an incentive for us to do training at scale yet (maybe other than learning how to?) without bettering the infra (since others are going to complete the low hanging fruits anyway), but I'll try to write some guides about the learnings along the way.

I understand. I think there is no miracle recipe and that we will have to experiment. I had the same case with a TTS to train in French, the problem being that it is difficult to validate the convergence when it no longer exists in the form of a classic loss curve (oscillation of the loss in a range for hundreds of thousands of steps). Only the validation steps can allow us to know where the training has arrived. I must admit that this kind of graph does not speak to me:

In any case, I thank you for all your information, I will reread all of this once again to understand everything and I will keep you informed of the progress of this project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LTX Video] Full finetuning #272

[LTX Video] Full finetuning #272

dorpxam commented Feb 26, 2025

a-r-r-o-w commented Feb 26, 2025

dorpxam commented Feb 26, 2025 •

edited

Loading

[LTX Video] Full finetuning #272

[LTX Video] Full finetuning #272

Comments

dorpxam commented Feb 26, 2025

a-r-r-o-w commented Feb 26, 2025

dorpxam commented Feb 26, 2025 • edited Loading

dorpxam commented Feb 26, 2025 •

edited

Loading