Skip to content

diffusers quantization training blog draft #2888

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

DerekLiu35
Copy link
Contributor

Preparing the Article

You're not quite done yet, though. Please make sure to follow this process (as documented here):

  • Add an entry to _blog.yml.
  • Add a thumbnail. There are no requirements here, but there is a template if it's helpful.
  • Check you use a short title and blog path.
  • Upload any additional assets (such as images) to the Documentation Images repo. This is to reduce bloat in the GitHub base repo when cloning and pulling. Try to have small images to avoid a slow or expensive user experience.
  • Add metadata (such as authors) to your md file. You can also specify guest or org for the authors.
  • Ensure the publication date is correct.
  • Preview the content. A quick way is to paste the markdown content in https://huggingface.co/new-blog. Do not click publish, this is just a way to do an early check.

@SunMarc

@SunMarc SunMarc requested a review from sayakpaul June 4, 2025 11:12
Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting started on this one. The early draft looks really well structured.

Left some early comments. Let me know if they make sense.


Now, we tackle **efficiently *fine-tuning* these models.** This post will guide you through fine-tuning FLUX.1-dev using QLoRA with the Hugging Face `diffusers` library. We'll showcase results from an NVIDIA RTX 4090.

## Why Not Just Full Fine-Tuning?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making bulleted points, I think we could do it in short paragraphs to convey the main points.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on this, also since specifically here, I think it's intuitive why not fully fine tuning so not necessary to have a dedicated title here imo

* **Pros:** Massively reduces trainable parameters, saving VRAM during training and resulting in small adapter checkpoints.
* **Cons (for base model memory):** The full-precision base model still needs to be loaded, which, for FLUX.1-dev, is still a hefty VRAM requirement even if fewer parameters are being updated.

**QLoRA: The Efficiency Powerhouse:** QLoRA enhances LoRA by:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider providing the QLoRA paper link as a reference.

The fine-tuned model nicely captured Mucha's iconic art nouveau style, evident in the decorative motifs and distinct color palette. The QLoRA process maintained excellent fidelity while learning the new style.

**Colab Adaptability:**
<!-- [add a section talking about / change above to be focused on running in google colab] -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice draft ! Really promising ! Left a bunch of comments

Comment on lines 5 to 6
Now, we tackle **efficiently *fine-tuning* these models.** This post will guide you through fine-tuning FLUX.1-dev using QLoRA with the Hugging Face `diffusers` library. We'll showcase results from an NVIDIA RTX 4090.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's emphasize on the VRAM that the user needs !

Comment on lines 15 to 21
**LoRA (Low-Rank Adaptation):** LoRA freezes the pre-trained weights and injects small, trainable "adapter" layers.
* **Pros:** Massively reduces trainable parameters, saving VRAM during training and resulting in small adapter checkpoints.
* **Cons (for base model memory):** The full-precision base model still needs to be loaded, which, for FLUX.1-dev, is still a hefty VRAM requirement even if fewer parameters are being updated.

**QLoRA: The Efficiency Powerhouse:** QLoRA enhances LoRA by:
1. Loading the pre-trained base model in a quantized format (typically 4-bit via `bitsandbytes`), drastically cutting the base model's memory footprint.
2. Training LoRA adapters (usually in FP16/BF16) on top of this quantized base.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some link to relevant docs if they want more information on lora and qlora. Adding images would be also nice on how lora and qlora works would be nice
this one is quite nite https://huggingface.co/docs/peft/main/en/developer_guides/lora#merge-lora-weights-into-the-base-model

Comment on lines 27 to 28
We aimed to fine-tune `black-forest-labs/FLUX.1-dev` to adopt the artistic style of Alphonse Mucha, using a small [dataset](https://huggingface.co/datasets/derekl35/alphonse-mucha-style).
<!-- (maybe use different dataset) -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we get more datasets, we can just publish a collection with multiple loras

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could include the LoRA I used in https://github.com/huggingface/diffusers/tree/main/examples/research_projects/flux_lora_quantization.

Maybe @linoytsaban can share a few more interesting datasets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 36 to 43
**LoRA (Low-Rank Adaptation) Deep Dive:**
LoRA works by decomposing weight updates into low-rank matrices. Instead of updating the full weight matrix $$W$$, LoRA learns two smaller matrices $$A$$ and $$B$$ such that the update is $$\Delta W = BA$$, where $$A \in \mathbb{R}^{r \times k}$$ and $$B \in \mathbb{R}^{d \times r}$$. The rank $$r$$ is typically much smaller than the original dimensions, drastically reducing trainable parameters. LoRA $$\alpha$$ is a scaling factor for the LoRA activations, often set to the same value as the $$r$$ or a multiple of it. It helps balance the influence of the pre-trained model and the LoRA adapter.

**8-bit Optimizer (AdamW):**
Standard AdamW optimizer maintains first and second moment estimates for each parameter in FP32, consuming significant memory. The 8-bit AdamW uses block-wise quantization to store optimizer states in 8-bit precision while maintaining training stability. This technique can reduce optimizer memory usage by ~75% compared to standard FP32 AdamW.

**Gradient Checkpointing:**
During forward pass, intermediate activations are typically stored for backward pass gradient computation. Gradient checkpointing trades computation for memory by only storing certain "checkpoint" activations and recomputing others during backpropagation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't see that you added this here. Maybe add the image here + add links.

Comment on lines +51 to +70
```python
# Determine compute dtype based on mixed precision
bnb_4bit_compute_dtype = torch.float32
if args.mixed_precision == "fp16":
bnb_4bit_compute_dtype = torch.float16
elif args.mixed_precision == "bf16":
bnb_4bit_compute_dtype = torch.bfloat16

nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=bnb_4bit_compute_dtype,
)

transformer = FluxTransformer2DModel.from_pretrained(
args.pretrained_model_name_or_path,
subfolder="transformer",
quantization_config=nf4_config,
torch_dtype=bnb_4bit_compute_dtype,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the model weights are in bfloat16, let's do the training in that specific dtype no ? cc @sayakpaul

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. bfloat16 is preferred. On Colab (T4), FP16 is needed as BF16 makes it cry.

Comment on lines 44 to 46

<!-- maybe explain cache latents -->

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice indeed !

Comment on lines 123 to 124
**Colab Adaptability:**
<!-- [add a section talking about / change above to be focused on running in google colab] -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

colab would be nice so that users feel like it is easier to reproduce what you did.

Would be also nice to have a command to run the script directly with this specific dataset. Feel free to modify the train_dreambooth_lora_flux_miniature.py script to remove parts that are not needed and upload the new script on diffusers repo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, T4 Colab would be terribly slow.

Comment on lines 129 to 130

<!-- [Maybe add a link to trained LoRA adapter on Hugging Face Hub.] -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to create a collection of adapters that you trained using this script

Comment on lines 108 to 113
base model:
![base model outputs](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/quantization-backends-diffusers2/alphonse_mucha_base_combined.png)

QLoRA fine-tuned:
![QLoRA model outputs](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/quantization-backends-diffusers2/alphonse_mucha_merged_combined.png)
*Prompts: (left to right)*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to do a section about inference after we trained the loras. Of course you can just load back the lora but it would be nice to let the user now that he can also merge the loras into the base model for efficient inference. cc @sayakpaul do ppl actually merge the loras or not in practice ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do ppl actually merge the loras or not in practice ?

It depends on use-cases and trade-offs. Sometimes people prefer merging to save VRAM, sometimes they don't to be able to experiment with different LoRAs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense ! let's clarify this in this blogpost

@@ -0,0 +1,130 @@
# Fine-Tuning FLUX.1-dev with QLoRA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe precise that users only consumer hardware ?

Fine-tuning FLUX.1-dev on consumer hardware with QLoRA

DerekLiu35 and others added 2 commits June 4, 2025 10:11
Copy link
Contributor

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left very general comments, very nicely put blog!


## Conclusion

QLoRA, coupled with the `diffusers` library, significantly democratizes the ability to customize state-of-the-art models like FLUX.1-dev. As demonstrated on an RTX 4090, efficient fine-tuning is well within reach, yielding high-quality stylistic adaptations. Importantly, these techniques are adaptable, paving the way for users on more constrained hardware, like Google Colab, to also participate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call-to-action with would be nice so that the blog converts to more models on Hub 🙏🏻

In our previous post, [Exploring Quantization Backends in Diffusers](https://huggingface.co/blog/diffusers-quantization), we dived into how various quantization techniques can shrink diffusion models like FLUX.1-dev, making them significantly more accessible for *inference* without drastically compromising performance. We saw how `bitsandbytes`, `torchao`, and others reduce memory footprints for generating images.

Performing inference is cool but to make these models truly our own, we also need to be able to fine-tune them. Therefore, in this post, we tackle **efficient** *fine-tuning* of these models with peak memory use under ~10 GB of VRAM on a single GPU. This post will guide you through fine-tuning FLUX.1-dev using QLoRA with the Hugging Face `diffusers` library. We'll showcase results from an NVIDIA RTX 4090.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToC would be nice for those who know basics and want to skip to the gist of it

)
transformer.add_adapter(transformer_lora_config)
```
Only these LoRA parameters become trainable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could put the nice prompt when you put the adapter and it shows number of trainable params vs total params here imo :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added total parameters and trainable parameters, not sure if that's what you meant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in PEFT, when you call print_trainable_parameters() something like this appears:
trainable params: 667493 || all params: 86466149 || trainable%: 0.77

I find it pretty cool :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh. This one isn’t a PEFT model though, it’s a diffusers model, so I had to use num_parameters() to get the total and trainable counts. Figured that’d give a similar overview!

DerekLiu35 and others added 2 commits June 5, 2025 10:46
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
The model consists of three main components:

* **Text Encoders (CLIP and T5):**
* **Function:** Process input text prompts. FLUX-dev uses CLIP for initial understanding and a larger T5 for nuanced comprehension and better text rendering.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to state function IMHO it seems a bit cluttered


Now, we tackle **efficiently *fine-tuning* these models.** This post will guide you through fine-tuning FLUX.1-dev using QLoRA with the Hugging Face `diffusers` library. We'll showcase results from an NVIDIA RTX 4090.

## Why Not Just Full Fine-Tuning?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on this, also since specifically here, I think it's intuitive why not fully fine tuning so not necessary to have a dedicated title here imo


**Full Fine-Tuning:** This traditional method updates all model params and offers the potential for the highest task-specific quality. However, for FLUX.1-dev, this approach would demand immense VRAM (multiple high-end GPUs), putting it out of reach for most individual users.

**LoRA (Low-Rank Adaptation):** [LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora) freezes the pre-trained weights and injects small, trainable "adapter" layers. This massively reduces trainable parameters, saving VRAM during training and resulting in small adapter checkpoints. The challenge is that the full-precision base model still needs to be loaded, which, for FLUX.1-dev, remains a hefty VRAM requirement even if fewer parameters are being updated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be nice to reference to previous blogs touching on this, e.g. https://huggingface.co/blog/lora


For a more detailed guide and code snippets, please refer to [this gist](https://gist.github.com/sayakpaul/f0358dd4f4bcedf14211eba5704df25a) and the [`diffusers-torchao` repository](https://github.com/sayakpaul/diffusers-torchao/tree/main/training).

## Inference with Trained LoRA Adapters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the purpose of the blog I'd make the repo with the Alphonse Mucha LoRA public, with the 2 inference approaches code examples. Also it would be nice to use the Gallery component

Comment on lines 294 to 296
If you've trained a LoRA for FLUX.1-dev, we encourage you to share it. Here's how you can do it:
- Follow this guide on [sharing models on the Hub](https://huggingface.co/docs/transformers/en/model_sharing).
- Add `flux` and `lora` as tags in your model card's metadata to make it easily discoverable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having here a code snippet here to push a LoRA to the hub will make it easier for people to push their model + we can use the model card template in the flux lora training script that already contains tags and such

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants