-
Notifications
You must be signed in to change notification settings - Fork 875
diffusers quantization training blog draft #2888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
diffusers quantization training blog draft #2888
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting started on this one. The early draft looks really well structured.
Left some early comments. Let me know if they make sense.
diffusers-quantization2.md
Outdated
|
||
Now, we tackle **efficiently *fine-tuning* these models.** This post will guide you through fine-tuning FLUX.1-dev using QLoRA with the Hugging Face `diffusers` library. We'll showcase results from an NVIDIA RTX 4090. | ||
|
||
## Why Not Just Full Fine-Tuning? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of making bulleted points, I think we could do it in short paragraphs to convey the main points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on this, also since specifically here, I think it's intuitive why not fully fine tuning so not necessary to have a dedicated title here imo
diffusers-quantization2.md
Outdated
* **Pros:** Massively reduces trainable parameters, saving VRAM during training and resulting in small adapter checkpoints. | ||
* **Cons (for base model memory):** The full-precision base model still needs to be loaded, which, for FLUX.1-dev, is still a hefty VRAM requirement even if fewer parameters are being updated. | ||
|
||
**QLoRA: The Efficiency Powerhouse:** QLoRA enhances LoRA by: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider providing the QLoRA paper link as a reference.
diffusers-quantization2.md
Outdated
The fine-tuned model nicely captured Mucha's iconic art nouveau style, evident in the decorative motifs and distinct color palette. The QLoRA process maintained excellent fidelity while learning the new style. | ||
|
||
**Colab Adaptability:** | ||
<!-- [add a section talking about / change above to be focused on running in google colab] --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also shed light into
- https://github.com/huggingface/diffusers/tree/main/examples/research_projects/flux_lora_quantization#inference
- Mention https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_hidream.md#using-quantization
- Showcase
torchao
FP8 training, too: https://github.com/sayakpaul/diffusers-torchao/tree/main/training
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice draft ! Really promising ! Left a bunch of comments
diffusers-quantization2.md
Outdated
Now, we tackle **efficiently *fine-tuning* these models.** This post will guide you through fine-tuning FLUX.1-dev using QLoRA with the Hugging Face `diffusers` library. We'll showcase results from an NVIDIA RTX 4090. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's emphasize on the VRAM that the user needs !
diffusers-quantization2.md
Outdated
**LoRA (Low-Rank Adaptation):** LoRA freezes the pre-trained weights and injects small, trainable "adapter" layers. | ||
* **Pros:** Massively reduces trainable parameters, saving VRAM during training and resulting in small adapter checkpoints. | ||
* **Cons (for base model memory):** The full-precision base model still needs to be loaded, which, for FLUX.1-dev, is still a hefty VRAM requirement even if fewer parameters are being updated. | ||
|
||
**QLoRA: The Efficiency Powerhouse:** QLoRA enhances LoRA by: | ||
1. Loading the pre-trained base model in a quantized format (typically 4-bit via `bitsandbytes`), drastically cutting the base model's memory footprint. | ||
2. Training LoRA adapters (usually in FP16/BF16) on top of this quantized base. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add some link to relevant docs if they want more information on lora and qlora. Adding images would be also nice on how lora and qlora works would be nice
this one is quite nite https://huggingface.co/docs/peft/main/en/developer_guides/lora#merge-lora-weights-into-the-base-model
diffusers-quantization2.md
Outdated
We aimed to fine-tune `black-forest-labs/FLUX.1-dev` to adopt the artistic style of Alphonse Mucha, using a small [dataset](https://huggingface.co/datasets/derekl35/alphonse-mucha-style). | ||
<!-- (maybe use different dataset) --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we get more datasets, we can just publish a collection with multiple loras
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could include the LoRA I used in https://github.com/huggingface/diffusers/tree/main/examples/research_projects/flux_lora_quantization.
Maybe @linoytsaban can share a few more interesting datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, here are a few:
diffusers-quantization2.md
Outdated
**LoRA (Low-Rank Adaptation) Deep Dive:** | ||
LoRA works by decomposing weight updates into low-rank matrices. Instead of updating the full weight matrix $$W$$, LoRA learns two smaller matrices $$A$$ and $$B$$ such that the update is $$\Delta W = BA$$, where $$A \in \mathbb{R}^{r \times k}$$ and $$B \in \mathbb{R}^{d \times r}$$. The rank $$r$$ is typically much smaller than the original dimensions, drastically reducing trainable parameters. LoRA $$\alpha$$ is a scaling factor for the LoRA activations, often set to the same value as the $$r$$ or a multiple of it. It helps balance the influence of the pre-trained model and the LoRA adapter. | ||
|
||
**8-bit Optimizer (AdamW):** | ||
Standard AdamW optimizer maintains first and second moment estimates for each parameter in FP32, consuming significant memory. The 8-bit AdamW uses block-wise quantization to store optimizer states in 8-bit precision while maintaining training stability. This technique can reduce optimizer memory usage by ~75% compared to standard FP32 AdamW. | ||
|
||
**Gradient Checkpointing:** | ||
During forward pass, intermediate activations are typically stored for backward pass gradient computation. Gradient checkpointing trades computation for memory by only storing certain "checkpoint" activations and recomputing others during backpropagation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't see that you added this here. Maybe add the image here + add links.
```python | ||
# Determine compute dtype based on mixed precision | ||
bnb_4bit_compute_dtype = torch.float32 | ||
if args.mixed_precision == "fp16": | ||
bnb_4bit_compute_dtype = torch.float16 | ||
elif args.mixed_precision == "bf16": | ||
bnb_4bit_compute_dtype = torch.bfloat16 | ||
|
||
nf4_config = BitsAndBytesConfig( | ||
load_in_4bit=True, | ||
bnb_4bit_quant_type="nf4", | ||
bnb_4bit_compute_dtype=bnb_4bit_compute_dtype, | ||
) | ||
|
||
transformer = FluxTransformer2DModel.from_pretrained( | ||
args.pretrained_model_name_or_path, | ||
subfolder="transformer", | ||
quantization_config=nf4_config, | ||
torch_dtype=bnb_4bit_compute_dtype, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the model weights are in bfloat16, let's do the training in that specific dtype no ? cc @sayakpaul
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. bfloat16
is preferred. On Colab (T4), FP16 is needed as BF16 makes it cry.
diffusers-quantization2.md
Outdated
|
||
<!-- maybe explain cache latents --> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice indeed !
diffusers-quantization2.md
Outdated
**Colab Adaptability:** | ||
<!-- [add a section talking about / change above to be focused on running in google colab] --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
colab would be nice so that users feel like it is easier to reproduce what you did.
Would be also nice to have a command to run the script directly with this specific dataset. Feel free to modify the train_dreambooth_lora_flux_miniature.py
script to remove parts that are not needed and upload the new script on diffusers repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, T4 Colab would be terribly slow.
diffusers-quantization2.md
Outdated
|
||
<!-- [Maybe add a link to trained LoRA adapter on Hugging Face Hub.] --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to create a collection of adapters that you trained using this script
diffusers-quantization2.md
Outdated
base model: | ||
 | ||
|
||
QLoRA fine-tuned: | ||
 | ||
*Prompts: (left to right)* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to do a section about inference after we trained the loras. Of course you can just load back the lora but it would be nice to let the user now that he can also merge the loras into the base model for efficient inference. cc @sayakpaul do ppl actually merge the loras or not in practice ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do ppl actually merge the loras or not in practice ?
It depends on use-cases and trade-offs. Sometimes people prefer merging to save VRAM, sometimes they don't to be able to experiment with different LoRAs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense ! let's clarify this in this blogpost
diffusers-quantization2.md
Outdated
@@ -0,0 +1,130 @@ | |||
# Fine-Tuning FLUX.1-dev with QLoRA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe precise that users only consumer hardware ?
Fine-tuning FLUX.1-dev on consumer hardware with QLoRA
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left very general comments, very nicely put blog!
diffusers-quantization2.md
Outdated
|
||
## Conclusion | ||
|
||
QLoRA, coupled with the `diffusers` library, significantly democratizes the ability to customize state-of-the-art models like FLUX.1-dev. As demonstrated on an RTX 4090, efficient fine-tuning is well within reach, yielding high-quality stylistic adaptations. Importantly, these techniques are adaptable, paving the way for users on more constrained hardware, like Google Colab, to also participate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call-to-action with would be nice so that the blog converts to more models on Hub 🙏🏻
diffusers-quantization2.md
Outdated
In our previous post, [Exploring Quantization Backends in Diffusers](https://huggingface.co/blog/diffusers-quantization), we dived into how various quantization techniques can shrink diffusion models like FLUX.1-dev, making them significantly more accessible for *inference* without drastically compromising performance. We saw how `bitsandbytes`, `torchao`, and others reduce memory footprints for generating images. | ||
|
||
Performing inference is cool but to make these models truly our own, we also need to be able to fine-tune them. Therefore, in this post, we tackle **efficient** *fine-tuning* of these models with peak memory use under ~10 GB of VRAM on a single GPU. This post will guide you through fine-tuning FLUX.1-dev using QLoRA with the Hugging Face `diffusers` library. We'll showcase results from an NVIDIA RTX 4090. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ToC would be nice for those who know basics and want to skip to the gist of it
) | ||
transformer.add_adapter(transformer_lora_config) | ||
``` | ||
Only these LoRA parameters become trainable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could put the nice prompt when you put the adapter and it shows number of trainable params vs total params here imo :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added total parameters and trainable parameters, not sure if that's what you meant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in PEFT, when you call print_trainable_parameters()
something like this appears:
trainable params: 667493 || all params: 86466149 || trainable%: 0.77
I find it pretty cool :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh. This one isn’t a PEFT model though, it’s a diffusers model, so I had to use num_parameters() to get the total and trainable counts. Figured that’d give a similar overview!
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
diffusers-quantization2.md
Outdated
The model consists of three main components: | ||
|
||
* **Text Encoders (CLIP and T5):** | ||
* **Function:** Process input text prompts. FLUX-dev uses CLIP for initial understanding and a larger T5 for nuanced comprehension and better text rendering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to state function IMHO it seems a bit cluttered
diffusers-quantization2.md
Outdated
|
||
Now, we tackle **efficiently *fine-tuning* these models.** This post will guide you through fine-tuning FLUX.1-dev using QLoRA with the Hugging Face `diffusers` library. We'll showcase results from an NVIDIA RTX 4090. | ||
|
||
## Why Not Just Full Fine-Tuning? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on this, also since specifically here, I think it's intuitive why not fully fine tuning so not necessary to have a dedicated title here imo
diffusers-quantization2.md
Outdated
|
||
**Full Fine-Tuning:** This traditional method updates all model params and offers the potential for the highest task-specific quality. However, for FLUX.1-dev, this approach would demand immense VRAM (multiple high-end GPUs), putting it out of reach for most individual users. | ||
|
||
**LoRA (Low-Rank Adaptation):** [LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora) freezes the pre-trained weights and injects small, trainable "adapter" layers. This massively reduces trainable parameters, saving VRAM during training and resulting in small adapter checkpoints. The challenge is that the full-precision base model still needs to be loaded, which, for FLUX.1-dev, remains a hefty VRAM requirement even if fewer parameters are being updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be nice to reference to previous blogs touching on this, e.g. https://huggingface.co/blog/lora
|
||
For a more detailed guide and code snippets, please refer to [this gist](https://gist.github.com/sayakpaul/f0358dd4f4bcedf14211eba5704df25a) and the [`diffusers-torchao` repository](https://github.com/sayakpaul/diffusers-torchao/tree/main/training). | ||
|
||
## Inference with Trained LoRA Adapters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the purpose of the blog I'd make the repo with the Alphonse Mucha LoRA public, with the 2 inference approaches code examples. Also it would be nice to use the Gallery component
diffusers-quantization2.md
Outdated
If you've trained a LoRA for FLUX.1-dev, we encourage you to share it. Here's how you can do it: | ||
- Follow this guide on [sharing models on the Hub](https://huggingface.co/docs/transformers/en/model_sharing). | ||
- Add `flux` and `lora` as tags in your model card's metadata to make it easily discoverable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having here a code snippet here to push a LoRA to the hub will make it easier for people to push their model + we can use the model card template in the flux lora training script that already contains tags and such
Preparing the Article
You're not quite done yet, though. Please make sure to follow this process (as documented here):
md
file. You can also specifyguest
ororg
for the authors.@SunMarc