diffusers quantization training blog draft #2888

DerekLiu35 · 2025-06-04T04:42:01Z

Preparing the Article

You're not quite done yet, though. Please make sure to follow this process (as documented here):

Add an entry to _blog.yml.
Add a thumbnail. There are no requirements here, but there is a template if it's helpful.
Check you use a short title and blog path.
Upload any additional assets (such as images) to the Documentation Images repo. This is to reduce bloat in the GitHub base repo when cloning and pulling. Try to have small images to avoid a slow or expensive user experience.
Add metadata (such as authors) to your md file. You can also specify guest or org for the authors.
Ensure the publication date is correct.
Preview the content. A quick way is to paste the markdown content in https://huggingface.co/new-blog. Do not click publish, this is just a way to do an early check.

@SunMarc

sayakpaul

Thanks for getting started on this one. The early draft looks really well structured.

Left some early comments. Let me know if they make sense.

diffusers-quantization2.md

sayakpaul · 2025-06-04T11:24:42Z

diffusers-quantization2.md

+
+Now, we tackle **efficiently *fine-tuning* these models.** This post will guide you through fine-tuning FLUX.1-dev using QLoRA with the Hugging Face `diffusers` library. We'll showcase results from an NVIDIA RTX 4090.
+
+## Why Not Just Full Fine-Tuning?


Instead of making bulleted points, I think we could do it in short paragraphs to convey the main points.

I agree on this, also since specifically here, I think it's intuitive why not fully fine tuning so not necessary to have a dedicated title here imo

sayakpaul · 2025-06-04T11:25:12Z

diffusers-quantization2.md

+* **Pros:** Massively reduces trainable parameters, saving VRAM during training and resulting in small adapter checkpoints.
+* **Cons (for base model memory):** The full-precision base model still needs to be loaded, which, for FLUX.1-dev, is still a hefty VRAM requirement even if fewer parameters are being updated.
+
+**QLoRA: The Efficiency Powerhouse:** QLoRA enhances LoRA by:


Consider providing the QLoRA paper link as a reference.

diffusers-quantization2.md

sayakpaul · 2025-06-04T11:27:20Z

diffusers-quantization2.md

+The fine-tuned model nicely captured Mucha's iconic art nouveau style, evident in the decorative motifs and distinct color palette. The QLoRA process maintained excellent fidelity while learning the new style.
+
+**Colab Adaptability:**
+<!-- [add a section talking about / change above to be focused on running in google colab] -->


We should also shed light into

https://github.com/huggingface/diffusers/tree/main/examples/research_projects/flux_lora_quantization#inference

Mention https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_hidream.md#using-quantization

Showcase torchao FP8 training, too: https://github.com/sayakpaul/diffusers-torchao/tree/main/training

SunMarc

Really nice draft ! Really promising ! Left a bunch of comments

SunMarc · 2025-06-04T11:15:51Z

diffusers-quantization2.md

+Now, we tackle **efficiently *fine-tuning* these models.** This post will guide you through fine-tuning FLUX.1-dev using QLoRA with the Hugging Face `diffusers` library. We'll showcase results from an NVIDIA RTX 4090.
+


Let's emphasize on the VRAM that the user needs !

SunMarc · 2025-06-04T11:17:22Z

diffusers-quantization2.md

+**LoRA (Low-Rank Adaptation):** LoRA freezes the pre-trained weights and injects small, trainable "adapter" layers.
+* **Pros:** Massively reduces trainable parameters, saving VRAM during training and resulting in small adapter checkpoints.
+* **Cons (for base model memory):** The full-precision base model still needs to be loaded, which, for FLUX.1-dev, is still a hefty VRAM requirement even if fewer parameters are being updated.
+
+**QLoRA: The Efficiency Powerhouse:** QLoRA enhances LoRA by:
+1.  Loading the pre-trained base model in a quantized format (typically 4-bit via `bitsandbytes`), drastically cutting the base model's memory footprint.
+2.  Training LoRA adapters (usually in FP16/BF16) on top of this quantized base.


Let's add some link to relevant docs if they want more information on lora and qlora. Adding images would be also nice on how lora and qlora works would be nice
this one is quite nite https://huggingface.co/docs/peft/main/en/developer_guides/lora#merge-lora-weights-into-the-base-model

SunMarc · 2025-06-04T11:22:32Z

diffusers-quantization2.md

+ We aimed to fine-tune `black-forest-labs/FLUX.1-dev` to adopt the artistic style of Alphonse Mucha, using a small [dataset](https://huggingface.co/datasets/derekl35/alphonse-mucha-style). 
+<!-- (maybe use different dataset) -->


if we get more datasets, we can just publish a collection with multiple loras

Could include the LoRA I used in https://github.com/huggingface/diffusers/tree/main/examples/research_projects/flux_lora_quantization.

Maybe @linoytsaban can share a few more interesting datasets.

sure, here are a few:

Yarn art style https://huggingface.co/datasets/Norod78/Yarn-art-style

Tarot card https://huggingface.co/datasets/multimodalart/1920-raider-waite-tarot-public-domain

3d icons https://huggingface.co/datasets/linoyts/3d_icon

SunMarc · 2025-06-04T11:26:16Z

diffusers-quantization2.md

+**LoRA (Low-Rank Adaptation) Deep Dive:**
+LoRA works by decomposing weight updates into low-rank matrices. Instead of updating the full weight matrix $$W$$, LoRA learns two smaller matrices $$A$$ and $$B$$ such that the update is $$\Delta W = BA$$, where $$A \in \mathbb{R}^{r \times k}$$ and $$B \in \mathbb{R}^{d \times r}$$. The rank $$r$$ is typically much smaller than the original dimensions, drastically reducing trainable parameters. LoRA $$\alpha$$ is a scaling factor for the LoRA activations, often set to the same value as the $$r$$ or a multiple of it. It helps balance the influence of the pre-trained model and the LoRA adapter.
+
+**8-bit Optimizer (AdamW):**
+Standard AdamW optimizer maintains first and second moment estimates for each parameter in FP32, consuming significant memory. The 8-bit AdamW uses block-wise quantization to store optimizer states in 8-bit precision while maintaining training stability. This technique can reduce optimizer memory usage by ~75% compared to standard FP32 AdamW.
+
+**Gradient Checkpointing:**
+During forward pass, intermediate activations are typically stored for backward pass gradient computation. Gradient checkpointing trades computation for memory by only storing certain "checkpoint" activations and recomputing others during backpropagation.


Didn't see that you added this here. Maybe add the image here + add links.

SunMarc · 2025-06-04T11:33:20Z

diffusers-quantization2.md

+```python
+# Determine compute dtype based on mixed precision
+bnb_4bit_compute_dtype = torch.float32
+if args.mixed_precision == "fp16":
+    bnb_4bit_compute_dtype = torch.float16
+elif args.mixed_precision == "bf16":
+    bnb_4bit_compute_dtype = torch.bfloat16
+
+nf4_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=bnb_4bit_compute_dtype,
+)
+
+transformer = FluxTransformer2DModel.from_pretrained(
+    args.pretrained_model_name_or_path,
+    subfolder="transformer",
+    quantization_config=nf4_config,
+    torch_dtype=bnb_4bit_compute_dtype,
+)


since the model weights are in bfloat16, let's do the training in that specific dtype no ? cc @sayakpaul

Yeah. bfloat16 is preferred. On Colab (T4), FP16 is needed as BF16 makes it cry.

SunMarc · 2025-06-04T11:36:36Z

diffusers-quantization2.md

+
+<!-- maybe explain cache latents -->
+


would be nice indeed !

SunMarc · 2025-06-04T11:38:05Z

diffusers-quantization2.md

+**Colab Adaptability:**
+<!-- [add a section talking about / change above to be focused on running in google colab] -->


colab would be nice so that users feel like it is easier to reproduce what you did.

Would be also nice to have a command to run the script directly with this specific dataset. Feel free to modify the train_dreambooth_lora_flux_miniature.py script to remove parts that are not needed and upload the new script on diffusers repo

However, T4 Colab would be terribly slow.

SunMarc · 2025-06-04T11:40:12Z

diffusers-quantization2.md

+
+<!-- [Maybe add a link to trained LoRA adapter on Hugging Face Hub.] -->


would be nice to create a collection of adapters that you trained using this script

SunMarc · 2025-06-04T11:42:43Z

diffusers-quantization2.md

+base model:
+![base model outputs](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/quantization-backends-diffusers2/alphonse_mucha_base_combined.png) 
+
+QLoRA fine-tuned:
+![QLoRA model outputs](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/quantization-backends-diffusers2/alphonse_mucha_merged_combined.png) 
+*Prompts: (left to right)*


we need to do a section about inference after we trained the loras. Of course you can just load back the lora but it would be nice to let the user now that he can also merge the loras into the base model for efficient inference. cc @sayakpaul do ppl actually merge the loras or not in practice ?

do ppl actually merge the loras or not in practice ?

It depends on use-cases and trade-offs. Sometimes people prefer merging to save VRAM, sometimes they don't to be able to experiment with different LoRAs.

make sense ! let's clarify this in this blogpost

SunMarc · 2025-06-04T11:46:26Z

diffusers-quantization2.md

@@ -0,0 +1,130 @@
+# Fine-Tuning FLUX.1-dev with QLoRA


Maybe precise that users only consumer hardware ?

Fine-tuning FLUX.1-dev on consumer hardware with QLoRA

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

merveenoyan

I left very general comments, very nicely put blog!

diffusers-quantization2.md

merveenoyan · 2025-06-05T13:16:44Z

diffusers-quantization2.md

+
+## Conclusion
+
+QLoRA, coupled with the `diffusers` library, significantly democratizes the ability to customize state-of-the-art models like FLUX.1-dev. As demonstrated on an RTX 4090, efficient fine-tuning is well within reach, yielding high-quality stylistic adaptations. Importantly, these techniques are adaptable, paving the way for users on more constrained hardware, like Google Colab, to also participate.


call-to-action with would be nice so that the blog converts to more models on Hub 🙏🏻

merveenoyan · 2025-06-05T13:17:11Z

diffusers-quantization2.md

+In our previous post, [Exploring Quantization Backends in Diffusers](https://huggingface.co/blog/diffusers-quantization), we dived into how various quantization techniques can shrink diffusion models like FLUX.1-dev, making them significantly more accessible for *inference* without drastically compromising performance. We saw how `bitsandbytes`, `torchao`, and others reduce memory footprints for generating images.
+
+Performing inference is cool but to make these models truly our own, we also need to be able to fine-tune them. Therefore, in this post, we tackle **efficient** *fine-tuning* of these models with peak memory use under ~10 GB of VRAM on a single GPU. This post will guide you through fine-tuning FLUX.1-dev using QLoRA with the Hugging Face `diffusers` library. We'll showcase results from an NVIDIA RTX 4090.
+


ToC would be nice for those who know basics and want to skip to the gist of it

merveenoyan · 2025-06-05T13:18:48Z

diffusers-quantization2.md

+)
+transformer.add_adapter(transformer_lora_config)
+```
+Only these LoRA parameters become trainable.


you could put the nice prompt when you put the adapter and it shows number of trainable params vs total params here imo :)

added total parameters and trainable parameters, not sure if that's what you meant

in PEFT, when you call print_trainable_parameters() something like this appears:
trainable params: 667493 || all params: 86466149 || trainable%: 0.77

I find it pretty cool :)

Ohh. This one isn’t a PEFT model though, it’s a diffusers model, so I had to use num_parameters() to get the total and trainable counts. Figured that’d give a similar overview!

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

merveenoyan · 2025-06-09T11:36:16Z

diffusers-quantization2.md

+The model consists of three main components:
+
+*   **Text Encoders (CLIP and T5):**
+    *   **Function:** Process input text prompts. FLUX-dev uses CLIP for initial understanding and a larger T5 for nuanced comprehension and better text rendering.


no need to state function IMHO it seems a bit cluttered

linoytsaban · 2025-06-12T14:41:09Z

diffusers-quantization2.md

+
+Now, we tackle **efficiently *fine-tuning* these models.** This post will guide you through fine-tuning FLUX.1-dev using QLoRA with the Hugging Face `diffusers` library. We'll showcase results from an NVIDIA RTX 4090.
+
+## Why Not Just Full Fine-Tuning?


I agree on this, also since specifically here, I think it's intuitive why not fully fine tuning so not necessary to have a dedicated title here imo

linoytsaban · 2025-06-12T14:43:04Z

diffusers-quantization2.md

+
+**Full Fine-Tuning:** This traditional method updates all model params and offers the potential for the highest task-specific quality. However, for FLUX.1-dev, this approach would demand immense VRAM (multiple high-end GPUs), putting it out of reach for most individual users.
+
+**LoRA (Low-Rank Adaptation):** [LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora) freezes the pre-trained weights and injects small, trainable "adapter" layers. This massively reduces trainable parameters, saving VRAM during training and resulting in small adapter checkpoints. The challenge is that the full-precision base model still needs to be loaded, which, for FLUX.1-dev, remains a hefty VRAM requirement even if fewer parameters are being updated.


could be nice to reference to previous blogs touching on this, e.g. https://huggingface.co/blog/lora

linoytsaban · 2025-06-12T14:48:52Z

diffusers-quantization2.md

+
+For a more detailed guide and code snippets, please refer to [this gist](https://gist.github.com/sayakpaul/f0358dd4f4bcedf14211eba5704df25a) and the [`diffusers-torchao` repository](https://github.com/sayakpaul/diffusers-torchao/tree/main/training).
+
+## Inference with Trained LoRA Adapters


for the purpose of the blog I'd make the repo with the Alphonse Mucha LoRA public, with the 2 inference approaches code examples. Also it would be nice to use the Gallery component

linoytsaban · 2025-06-12T14:52:05Z

diffusers-quantization2.md

+If you've trained a LoRA for FLUX.1-dev, we encourage you to share it. Here's how you can do it:
+- Follow this guide on [sharing models on the Hub](https://huggingface.co/docs/transformers/en/model_sharing).
+- Add `flux` and `lora` as tags in your model card's metadata to make it easily discoverable.


I think having here a code snippet here to push a LoRA to the hub will make it easier for people to push their model + we can use the model card template in the flux lora training script that already contains tags and such

draft

2f00e10

SunMarc requested a review from sayakpaul June 4, 2025 11:12

sayakpaul reviewed Jun 4, 2025

View reviewed changes

SunMarc reviewed Jun 4, 2025

View reviewed changes

DerekLiu35 and others added 2 commits June 4, 2025 10:11

Apply suggestions from code review

6b2f771

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

apply some suggestions from code review

d8edaac

merveenoyan reviewed Jun 5, 2025

View reviewed changes

DerekLiu35 and others added 2 commits June 5, 2025 10:46

Apply suggestions from code review

d7f9b7e

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

add ToC + call to action

123387d

merveenoyan reviewed Jun 9, 2025

View reviewed changes

DerekLiu35 added 4 commits June 9, 2025 17:41

remove model components function descriptions

1f603a0

add torchao fp8 training

e62bf9e

add code snippets for inference

79694bf

typo

2d55e03

linoytsaban reviewed Jun 12, 2025

View reviewed changes

DerekLiu35 added 4 commits June 12, 2025 23:16

remove Why Not Just Full Fine-Tuning section

d5c0044

add more code snippets

58a1c5a

change mixed_precision to bf16

f8a276a

add link to google colab

e82632a


		Now, we tackle *efficiently fine-tuning* these models.** This post will guide you through fine-tuning FLUX.1-dev using QLoRA with the Hugging Face `diffusers` library. We'll showcase results from an NVIDIA RTX 4090.

		## Why Not Just Full Fine-Tuning?

		We aimed to fine-tune `black-forest-labs/FLUX.1-dev` to adopt the artistic style of Alphonse Mucha, using a small [dataset](https://huggingface.co/datasets/derekl35/alphonse-mucha-style).
		<!-- (maybe use different dataset) -->

		Colab Adaptability:
		<!-- [add a section talking about / change above to be focused on running in google colab] -->


		<!-- [Maybe add a link to trained LoRA adapter on Hugging Face Hub.] -->


		## Conclusion

		QLoRA, coupled with the `diffusers` library, significantly democratizes the ability to customize state-of-the-art models like FLUX.1-dev. As demonstrated on an RTX 4090, efficient fine-tuning is well within reach, yielding high-quality stylistic adaptations. Importantly, these techniques are adaptable, paving the way for users on more constrained hardware, like Google Colab, to also participate.

		In our previous post, [Exploring Quantization Backends in Diffusers](https://huggingface.co/blog/diffusers-quantization), we dived into how various quantization techniques can shrink diffusion models like FLUX.1-dev, making them significantly more accessible for inference without drastically compromising performance. We saw how `bitsandbytes`, `torchao`, and others reduce memory footprints for generating images.

		Performing inference is cool but to make these models truly our own, we also need to be able to fine-tune them. Therefore, in this post, we tackle efficient fine-tuning of these models with peak memory use under ~10 GB of VRAM on a single GPU. This post will guide you through fine-tuning FLUX.1-dev using QLoRA with the Hugging Face `diffusers` library. We'll showcase results from an NVIDIA RTX 4090.


		Full Fine-Tuning: This traditional method updates all model params and offers the potential for the highest task-specific quality. However, for FLUX.1-dev, this approach would demand immense VRAM (multiple high-end GPUs), putting it out of reach for most individual users.

		LoRA (Low-Rank Adaptation): [LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora) freezes the pre-trained weights and injects small, trainable "adapter" layers. This massively reduces trainable parameters, saving VRAM during training and resulting in small adapter checkpoints. The challenge is that the full-precision base model still needs to be loaded, which, for FLUX.1-dev, remains a hefty VRAM requirement even if fewer parameters are being updated.


		For a more detailed guide and code snippets, please refer to [this gist](https://gist.github.com/sayakpaul/f0358dd4f4bcedf14211eba5704df25a) and the [`diffusers-torchao` repository](https://github.com/sayakpaul/diffusers-torchao/tree/main/training).

		## Inference with Trained LoRA Adapters

diffusers quantization training blog draft #2888

Are you sure you want to change the base?

diffusers quantization training blog draft #2888

Uh oh!

Conversation

DerekLiu35 commented Jun 4, 2025

Preparing the Article

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merveenoyan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!