3D Parallel + Model Spec API #245

a-r-r-o-w · 2025-01-25T02:04:26Z

Model Specification

TODO

Parallel Backends

Finetrainers supports parallel training on multiple GPUs & nodes. This is done using the Pytorch DTensor backend.

As an experiment for comparing performance of different training backends, I've implemented multi-backend support. These backends may or may not fully rely on Pytorch's distributed DTensor solution. Currently, only 🤗 Accelerate is supported for backwards-compatibility reasons (as we initially started with only Accelerate support). In the near future, there are plans for integrating natively with:

Native support for context-parallel and pipeline-parallel, based on Pytorch DTensor and custom ParaAttention & xDiT inspired solutions is also planned. Note that this is just for experimental purposes to satisfy my curiosity and questions regarding performance of different frameworks. Users should only expect stable support with accelerate and pytorch dtensor.

Support matrix currently in this PR that have been verified to work:

Backend	DDP	FSDP1	FSDP2	HSDP	TP	PP	CP
PTD	🤗	😡	🤗	🤗	🤗	😡	😡
Accelerate	🤗	😡	😡	😡	😡	😡	😡

Check the docs for more information.

Training improvements

More information is logged to wandb for better model performance tracking
Users can now modify targets and sigmas in the model specification's forward method for more control over what they are optimizing for
Logs can be logged at a chosen interval duration instead of for each step
(postponed, will be supported in diffusers directly) Flash attention
(postponed, due to poor results and NCCL hangs): Native H100 fp8 training using TorchAO
(postponed, due to requiring debugging to make general purpose for all models): Pipeline parallel

Precomputation

A new mechanism for preprocessing and batched training has been implemented so that medium-to-large scale datasets can be handled efficiently without using too much disk space. The DistributedDataProcessor is fed in the dataloader iterators and processes a fixed batch of --precomputation_items and saves them to --precomputation_dir. These items can then be used for batched training based on frame-height-width bucket collation. By default, 512 items are precomputed but should be adjusted by users based on available disk space and scale of training.

Processors

This PR introduces a ProcessorMixin. It's an attempt to provide a standard interface for creating graph-based dataset manipulation from source data to input data for condition/latents models. A processor should ideally do very simple functionality and one thing only, so that multiple processors can be composed together. Processors should be invoked from the ModelSpecification::prepare_latents and ModelSpecification::prepare_conditions methods. Users are not required to use them (as in, this is opt-in), so can use any custom logic for preprocessing.

TODO: show an example

Environment

Currently, the only tested functional environment is as follows (output obtained from diffusers-cli env)

Finetrainers has only been widely tested with the following environment (output obtained by running diffusers-cli env):

- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.14
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): 0.8.5 (cpu)
- Jax version: 0.4.31
- JaxLib version: 0.4.31
- Huggingface_hub version: 0.28.1
- Transformers version: 4.48.0.dev0
- Accelerate version: 1.1.0.dev0
- PEFT version: 0.14.1.dev0
- Bitsandbytes version: 0.43.3
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA DGX Display, 4096 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB

Other changes

Adds support for LTX first-frame conditioning
Allows usage of arbitrary text encoders, tokenizers, transformers outside of the pretrained_model_name_or_path
Allows more flexibility in the calculation of prediction and targets, so that we don't have to workaround for models like Mochi

a-r-r-o-w · 2025-01-25T12:58:55Z

Actually, no urgent reason to merge this soon. Will get DDP/FSDP, PP and TP working together atleast, since if we are going to be flaky for a bit, it's best to have all features available for testing and work my way through bugs

… api; ltx model-spec tests

…eckpointing/parallelizing

…cumulation

a-r-r-o-w added 5 commits January 25, 2025 02:52

model specification

94c5936

update

ad77333

update

4e58f69

attempt ltx image condition

96f58bf

update

37156af

use image-to-video pipeline for ltx if image is provided

3f8c097

This was referenced Jan 28, 2025

load_diffusion_models called twice in trainer.py? #235

Open

Model trains for more steps than requested #216

Open

Removing the FP8 specific bits #249

Closed

[core] Ensure loading mp first #252

Merged

[chore] relax requirements a bit. #253

Closed

a-r-r-o-w added 3 commits January 31, 2025 02:40

revert to dp replica working state

3999433

add debug note

069a653

make activation checkpointing work with fsdp

c7003ec

a-r-r-o-w changed the title ~~Model Specification API~~ 4D Parallel + Model Spec API Jan 31, 2025

This was referenced Feb 7, 2025

[WIP][tests] add precomputation tests #234

Merged

LTX-Video Image2Video LORA? #240

Closed

LTX Image2Video LoRA #150

Closed

a-r-r-o-w added 2 commits February 13, 2025 01:35

update

d8749fe

update

88fe2bd

a-r-r-o-w mentioned this pull request Feb 19, 2025

Difficult to converge when training hundreds / thousands of videos #265

Open

a-r-r-o-w added 8 commits February 19, 2025 01:23

add accelerate backend

de35ccc

some qol improvements

2ac087e

update

e932d7c

update

34a882a

refactor processors; distributed data processor; refactor user-facing…

cf7c41d

… api; ltx model-spec tests

update tests

5219f52

update

ac1a256

remove dead code

3beef62

a-r-r-o-w added 5 commits February 26, 2025 02:27

update

6e00b25

update docs

aefe6de

handle checkpointing

002c11c

update docs

83cdca5

fix fqn mapping error due to creating optimizers before activation ch…

ab0dc9b

…eckpointing/parallelizing

a-r-r-o-w mentioned this pull request Feb 26, 2025

[LTX Video] Full finetuning #272

Open

a-r-r-o-w added 12 commits February 26, 2025 20:18

custom arg parsing for trainer configs; doc improvements; gradient ac…

d0d4af2

…cumulation

improve dataset handling

ae29496

update

1483b30

improve dataset precomputation

e65afba

add ltxvideo pika crush example

8000d9f

Merge branch 'main' into model-spec

cf1ffa9

Merge branch 'main' into model-spec

a150002

add back some changes

435a0be

update

ac779c8

fix

da0d1eb

fix test

1155d0c

update tests

7cb8836

a-r-r-o-w marked this pull request as ready for review March 3, 2025 10:27

This was referenced Mar 3, 2025

CogVideoX ModelSpec #280

Merged

Wan T2V ModelSpec #281

Merged

a-r-r-o-w changed the title ~~4D Parallel + Model Spec API~~ 3D Parallel + Model Spec API Mar 3, 2025

a-r-r-o-w merged commit 9bb9aff into main Mar 3, 2025
1 check passed

a-r-r-o-w deleted the model-spec branch March 3, 2025 10:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3D Parallel + Model Spec API #245

3D Parallel + Model Spec API #245

a-r-r-o-w commented Jan 25, 2025 •

edited

Loading

a-r-r-o-w commented Jan 25, 2025

3D Parallel + Model Spec API #245

3D Parallel + Model Spec API #245

Conversation

a-r-r-o-w commented Jan 25, 2025 • edited Loading

Model Specification

Parallel Backends

Training improvements

Precomputation

Processors

Environment

Other changes

a-r-r-o-w commented Jan 25, 2025

a-r-r-o-w commented Jan 25, 2025 •

edited

Loading