Skip to content

Releases: a-r-r-o-w/finetrainers

v0.2.0

25 Apr 13:23
Compare
Choose a tag to compare

Finetrainers v0.2.0 🧪

New trainers

  • Channel concatenated control conditioning for Wan2.1 and CogView4
Wan image-conditioning on T2V model
wan-t2v-image-conditioning.mp4
CogView4 control conditioning (Edit + Canny)

The training involves adding extra input channels to the patch embedding layer (referred to as the "control injection" layer in finetrainers), to mix conditioning features into the latent stream. This architecture choice is very common and has been seen before in many models - CogVideoX-I2V, HunyuanVideo-I2V, Alibaba's Fun Control models, etc. Due to the popularity and simplicity in the architecture choice, it is a good choice to support standalone as a trainer.

import torch
from diffusers import CogView4Pipeline
from diffusers.utils import load_image
from finetrainers.models.utils import _expand_linear_with_zeroed_weights
from finetrainers.patches import load_lora_weights
from finetrainers.patches.dependencies.diffusers.control import control_channel_concat

dtype = torch.bfloat16
device = torch.device("cuda")
generator = torch.Generator().manual_seed(0)

pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=dtype)

in_channels = pipe.transformer.config.in_channels
patch_channels = pipe.transformer.patch_embed.proj.in_features
pipe.transformer.patch_embed.proj = _expand_linear_with_zeroed_weights(pipe.transformer.patch_embed.proj, new_in_features=2 * patch_channels)

load_lora_weights(pipe, "finetrainers/CogView4-6B-Edit-LoRA-v0", "cogview4-lora")
pipe.set_adapters("cogview4-lora", 0.9)
pipe.to(device)

prompt = "Make the image look like it's from an ancient Egyptian mural."
control_image = load_image("examples/training/control/cogview4/omni_edit/validation_dataset/0.png")
height, width = 1024, 1024

with torch.no_grad():
    latents = pipe.prepare_latents(1, in_channels, height, width, dtype, device, generator)
    control_image = pipe.image_processor.preprocess(control_image, height=height, width=width)
    control_image = control_image.to(device=device, dtype=dtype)
    control_latents = pipe.vae.encode(control_image).latent_dist.sample(generator=generator)
    control_latents = (control_latents - pipe.vae.config.shift_factor) * pipe.vae.config.scaling_factor

with control_channel_concat(pipe.transformer, ["hidden_states"], [control_latents], dims=[1]):
    image = pipe(prompt, latents=latents, num_inference_steps=30, generator=generator).images[0]

image.save("output.png")

New models supported

  • FLUX.1-dev
  • Wan2.1 I2V

Find example training configs here.

Attention

Support for multiple different attention providers for training and inference - Pytorch native, flash-attn, sageattention, xformers, flex. See docs for more details.

Other major changes

  • Better regional compilation support

What's Changed

New Contributors

Full Changelog: v0.1.0...v0.2.0

v0.1.0

21 Mar 11:26
5ea0457
Compare
Choose a tag to compare

Finetrainers v0.1.0 🧪

output.mp4

New models supported

  • CogView4
  • Wan 2.1

Checkpoints released

Other major changes

  • Fully support for Accelerate as a parallelization backend again
  • Opt-in precomputation. Helpful for smaller datasets
  • Better remote dataset loading support
  • Support for datasets>=3.4.0
  • Bug fix: Layerwise Casting now works with Wan
  • Bug fix: LTX Video training now works with batch_size > 1

What's Changed

New Contributors

Full Changelog: v0.0.1...v0.1.0

v0.0.1

24 Feb 10:15
41dd338
Compare
Choose a tag to compare

FineTrainers v0.0.1 🧪

FineTrainers is a work-in-progress library to support (accessible) training of diffusion models. The following models are currently supported (based on Diffusers):

  • CogVideoX T2V (versions 1.0 and 1.5)
  • LTX Video
  • Hunyuan Video

The legacy/deprecated scripts also support CogVideoX I2V and Mochi.

Currently, LoRA and Full-rank finetuning is supported. With time, more models and training techniques will be supported. We thank our many contributors for their amazing work at improve finetrainers. They are mentioned below in the "New Contributors" section.

In a short timespan, finetrainers has found its way into multiple research works, which has been a very motivating factor for us. They are mentioned in the "Featured Projects" section of the README. We hope you find them interesting and continue to build & work on interesting ideas, while sharing your research artifacts openly!

Some artifacts that we've released ourselves is available here: https://huggingface.co/finetrainers

We plan to focus on core algorithms/models that users prefer to have support for quickly, primarily based on the feedback we've received (thanks to everyone who's spoken with me regarding this. Your time is invaluable!) The majors asks are:

  • more models and faster support for newer models (we will open this up for contributions after a major open but pending PR, and add many ourselves!)
  • compatibility with UIs that do not support standardized implementations from diffusers (we will write two-way conversion scripts for new models that are added to diffusers, so that it is easy to obtain original-format weights from diffusers-format weights)
  • more algorithms (Control LoRA, ControlNets for video models and VideoJAM are some of the highly asked techniques -- we will prioritize this!)
  • Dataset QoL changes (this is a WIP in an open but pending PR)

Let us know what you'd like to see next & stay tuned for interesting updates!

output.mp4

What's Changed

Read more