-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LTX Video] Full finetuning #272
Comments
Thanks for the detailed writeup!
Would like to quote one of my favourite youtube channels here, Two Minutes Papers, to say that we shouldn't be looking at what's available now, but instead imagine what would be possible two papers down the line. Sure, it's harder to run at the moment, but people will figure out a way to cleverly offload, reduce computation required, distill, and apply loads of clever tricks to make local inference faster. There is high demand for such models, so it's only a matter of time someone releases consumer-friendly models competitive against closed-source state-of-the-art. The same happened to the LLMs field and now we have people running better models than the OG ChatGPT world-killer-robots-going-to-take-over-my-job-oh-god-im-losing-my-mind-wtf-model on a local 2x Mac mini with decent decoding speed.
This is actually a hard question to answer without properly studying the scaling laws for smaller video diffusion models like LTXVideo. To be quite honest, I can't provide an answer that would definitely be correct. Instead, I would recommend doing some hyperparameter ablations of your own and going through this paper for some good starting points. One things I've convincingly found to work better is the LTX first-frame image conditioning (the current finetrainers v0.0.1 release does not support it but #245 does), so if I were you, I'd definitely dig into that a bit. From my experiments, I have not found any evidence for multi-resolution training to be helpful if you want to generate videos at a fixed resolution/frame count. So, if you're looking to finetune for a specific use-case, I'd suggest to train with same resolution/length videos in bulk. A decent starting point would be ~100-200 videos to try and see how much you can overfit without the model completely collapsing. Once you've gotten it to overfit on about ~30-50% with a low number of training steps (it doesn't have to be the exact videos but able to replicate parts of it), you can scale your dataset to 1000+ videos with the same hyperparameters (ideally, something like I haven't done any large scale training yet (> 1M+ image/video), so can't really answer optimal hyperparameter decisions for given dataset size if that's what you're going for -- I will eventually be able to after making our training scalable and maximizing our FLOP utilization to as much as I know how to. There is not really an incentive for us to do training at scale yet (maybe other than learning how to?) without bettering the infra (since others are going to complete the low hanging fruits anyway), but I'll try to write some guides about the learnings along the way. |
I agree ofcourse. I'm just sad to see that a model like the recent Wan2.1 comes with the slogan 'Consumer Friendly' with a minimum of 40GB VRAM for running the 14B model. While in the same time, the 1.3B isn't more efficient and flexible than the LTX model himself. A mix of competition and hype that makes one model hide another while its potential is not exploited. I remember the time of demos on the first personal computers, the goal was to push the limits of the hardware. Currently, the only limit that is pushed is the amount of power needed to run the models. And I'm talking about pure models. Distillation and quantization has its limits, especially in the field of videos. The impact of the quantization on the LTXV is clear when you use LoRAs.
Oh! I will read this paper (Towards Precise Scaling Laws for Video Diffusion Transformers) one or two times, that seem very interesting at first sight. I've read the section of Image Conditioning in the paper but I must admit that my level of understanding is limited at this level. If I understand the idea roughly, the principle is an equivalent of a keyframe, a bit like in 3D animation software? Anyway, all enhancements are welcome. However, I read, I don't know where, that Lightricks was going to release a version 1.0 of the LTXV model and that it will change the model significantly, which is why few people have started training (finetuning/LoRA). Maybe you are aware of this? Anyway, no pressure, I am far from finished in the realization of the ideal corpus for this model. I still have several weeks of work, in particular to refine the prompts: I use the VideoLlama3-7B model here with a specialized system prompt. Unfortunately, the results are uneven sometimes, and it is almost impossible to obtain an 'a-la-LTXV' prompt directly with this model. So I apply an iterative method by generating several prompts (3-5 for now) with the vision models, these prompts will be re-injected into a heavier and more efficient model (pure LLM - I have to experiment for the choice of final model) to refine a finer definitive prompt following the LTXV rules.
So, for you, the model able to generalize beyond resolution? However, from experience, I can tell you that if I generate portrait videos using Lightricks' example prompts intended for landscape, the portrait video produced seems to be a truncated version of a landscape video (a cut without framing, with characters out of frame and not framed). Is this a model problem or simply a lack of training data, I don't know, but I assure you that the problem is very real. This is why I intended to start with two formats (portrait/landscape) in the same resolution. In any case, the videos I have exist in these two formats. If I eliminate one format or the other, I lose part of the corpus knowing that I am very demanding on the choice of videos and the final selection.
Very interesting. So, I think I will start to do experimentation before doing a full training. But, think that 1000+ video, even 10000 for example, it's only just less 28 hours of video (for 10 seconds videos). I had thought I might have to push to a hundred hours (or about 300k videos maybe more, 500k ?), that's why I asked the question. The problem lies mainly in the balance of the choice of videos, having several models of each type of video: different angles in the case of a walker for example. Static camera that follows the movement. Moving camera. Rotation around the character. There are many scenarios and the idea is to provide a corpus that is both solid, varied and balanced. As a result, the quantity of short videos (10 seconds) can climb quickly. Especially since my sources are longer, at least 30 seconds to several minutes, with different angles, which allows me to sometimes extract several sequences from the same source video. In terms of anatomy, diversity of gender, race, hair color, clothing ... I would like to have a complete palette hoping that this confirms the idea that the training data counts as much as the model itself.
I understand. I think there is no miracle recipe and that we will have to experiment. I had the same case with a TTS to train in French, the problem being that it is difficult to validate the convergence when it no longer exists in the form of a classic loss curve (oscillation of the loss in a range for hundreds of thousands of steps). Only the validation steps can allow us to know where the training has arrived. I must admit that this kind of graph does not speak to me: In any case, I thank you for all your information, I will reread all of this once again to understand everything and I will keep you informed of the progress of this project. |
Video models are coming one after another (recently Wan2.1) but they are becoming increasingly heavy and impossible to use on local machines with a simple decent GPU. I understand that this corresponds to research and that academics often have heavy training resources, but ultimately who is the production of short videos of a few seconds intended for?
I have used the LTX Video model quite a bit, which despite all its shortcomings is the only model that is pleasant to use. It can produce 10-second videos in 512x896 (a quarter of a 4K UHD video) respecting the division by 32 of the resolution on my small 16GB GPU, and this in one minute. And again, it is not so much the sampling that consumes VRAM, it is the VAE decoding. From now on, Hugging Face offers a free online decoding service.
There are some issues with the LTX Video model:
That is why I intend to try a fine-tuning of the model with finetrainers.
My goal is to finalize my video corpus whose overall characteristics are:
My question is rather simple? I have access to hundreds of hours of UHD videos, I am sorting and managing the prompting with the best current multimodal models.
In your opinion, what is the ideal quantity for this type of finetuning? If there is an ideal quantity of course!
The text was updated successfully, but these errors were encountered: