3D Parallel + Model Spec API (#245)

* model specification * update * update * attempt ltx image condition * update * use image-to-video pipeline for ltx if image is provided * revert to dp replica working state * add debug note * make activation checkpointing work with fsdp * update * update * add accelerate backend * some qol improvements * update * update * refactor processors; distributed data processor; refactor user-facing api; ltx model-spec tests * update tests * update * remove dead code * update * update docs * handle checkpointing * update docs * fix fqn mapping error due to creating optimizers before activation checkpointing/parallelizing * custom arg parsing for trainer configs; doc improvements; gradient accumulation * improve dataset handling * update * improve dataset precomputation * add ltxvideo pika crush example * add back some changes * update * fix * fix test * update tests
a-r-r-o-w · Mar 3, 2025 · 9bb9aff · 9bb9aff
1 parent c4d6c6c
commit 9bb9aff
Show file tree

Hide file tree

Showing 140 changed files with 7,943 additions and 4,137 deletions.
diff --git a/.gitignore b/.gitignore
@@ -168,10 +168,9 @@ cython_debug/
 wandb/
 *.txt
 dump*
-*dummy*
 outputs*
 *.slurm
 .vscode/
-*.json
+*dummy*
 
 !requirements.txt
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,41 @@
+# How to contribute to Finetrainers
+
+Finetrainers is an early-stage library for training diffusion models. Everyone is welcome to contribute - models, algorithms, refactors, docs, etc. - but due to the early stage of the project, we recommend bigger contributions be discussed in an issue before submitting a PR. Eventually, we will have a better process for this!
+
+## How to contribute
+
+### Adding a new model
+
+If you would like to add a new model, please follow these steps:
+
+- Create a new file in the `finetrainers/models` directory with the model name (if it's new), or use the same directory if it's a variant of an existing model.
+- Implement the model specification in the file. For more details on what a model specification should look like, see the [ModelSpecification](TODO(aryan): add link) documentation.
+- Update the supported configs in `finetrainers/config.py` to include the new model and the training types supported.
+- Add a dummy model specification in the `tests/models` directory.
+- Make sure to test training with the following settings:
+  - Single GPU
+  - 2x GPU with `--dp_degree 2 --dp_shards 1`
+  - 2x GPU with `--dp_degree 1 --dp_shards 2`
+
+  For `SFTTrainer` additions, please make sure to train with atleast 1000 steps (atleast 2000 data points) to ensure the model training is working as expected.
+- Open a PR with your changes. Please make sure to share your wandb logs for the above training settings in the PR description. This will help us verify the training is working as expected.
+
+### Adding a new algorithm
+
+Currently, we are not accepting algorithm contributions. We will update this section once we are better ready 🤗
+
+### Refactors
+
+The library is in a very early stage. There are many instances of dead code, poorly written abstractions, and other issues. If you would like to refactor/clean-up a part of the codebase, please open an issue to discuss the changes before submitting a PR.
+
+### Dataset improvements
+
+Any changes to dataset/dataloader implementations can be submitted directly. The improvements and reasons for the changes should be conveyed appropriately for us to move quickly 🤗
+
+### Documentation
+
+Due to the early stage of the project, the documentation is not as comprehensive as we would like. Any improvements/refactors are welcome directly!
+
+## Asking for help
+
+If you have any questions, feel free to open an issue and we will be sure to help you out asap! Please make sure to describe your issues in either English (preferable) or Chinese. Any other language will make it hard for us to help you, so we will most likely close such issues without explanation/answer.
diff --git a/Makefile b/Makefile
@@ -1,11 +1,11 @@
 .PHONY: quality style
 
-check_dirs := finetrainers tests examples
+check_dirs := finetrainers tests examples train.py
 
 quality:
 	ruff check $(check_dirs) --exclude examples/_legacy
 	ruff format --check $(check_dirs) --exclude examples/_legacy
 
 style:
 	ruff check $(check_dirs) --fix --exclude examples/_legacy
-	ruff format $(check_dirs) --exclude examples/_legacy
+	ruff format $(check_dirs) --exclude examples/_legacy
diff --git a/README.md b/README.md
@@ -2,9 +2,7 @@
 
 FineTrainers is a work-in-progress library to support (accessible) training of video models. Our first priority is to support LoRA training for all popular video models in [Diffusers](https://github.com/huggingface/diffusers), and eventually other methods like controlnets, control-loras, distillation, etc.
 
-> [!NOTE]
->
-> `cogvideox-factory` was renamed to `finetrainers`. If you're looking to train CogVideoX or Mochi with the legacy training scripts, please refer to [this](./examples/_legacy/) README instead.
+`cogvideox-factory` was renamed to `finetrainers`. If you're looking to train CogVideoX or Mochi with the legacy training scripts, please refer to [this](./training/README.md) README instead. Everything in the `training/` directory will be eventually moved and supported under `finetrainers`.
 
 <table align="center">
 <tr>
@@ -18,128 +16,48 @@ FineTrainers is a work-in-progress library to support (accessible) training of v
 - 🔥 **2025-02-12**: Check out [eisneim/ltx_lora_training_i2v_t2v](https://github.com/eisneim/ltx_lora_training_i2v_t2v/)! It builds off of `finetrainers` to support image to video training for LTX-Video and STG guidance for inference.
 - 🔥 **2025-01-15**: Support for naive FP8 weight-casting training added! This allows training HunyuanVideo in under 24 GB upto specific resolutions.
 - 🔥 **2025-01-13**: Support for T2V full-finetuning added! Thanks to [@ArEnSc](https://github.com/ArEnSc) for taking up the initiative!
-- 🔥 **2025-01-03**: Support for T2V LoRA finetuning of [CogVideoX](https://huggingface.co/docs/diffusers/main/api/pipelines/cogvideox) added! 
+- 🔥 **2025-01-03**: Support for T2V LoRA finetuning of [CogVideoX](https://huggingface.co/docs/diffusers/main/api/pipelines/cogvideox) added!
 - 🔥 **2024-12-20**: Support for T2V LoRA finetuning of [Hunyuan Video](https://huggingface.co/docs/diffusers/main/api/pipelines/hunyuan_video) added! We would like to thank @SHYuanBest for his work on a training script [here](https://github.com/huggingface/diffusers/pull/10254).
 - 🔥 **2024-12-18**: Support for T2V LoRA finetuning of [LTX Video](https://huggingface.co/docs/diffusers/main/api/pipelines/ltx_video) added!
 
 ## Table of Contents
 
-* [Quickstart](#quickstart)
-* [Support Matrix](#support-matrix)
-* [Acknowledgements](#acknowledgements)
+- [Quickstart](#quickstart)
+- [Support Matrix](#support-matrix)
+- [Featured Projects](#featured-projects)
+- [Acknowledgements](#acknowledgements)
 
 ## Quickstart
 
-Clone the repository and make sure the requirements are installed: `pip install -r requirements.txt` and install `diffusers` from source by `pip install git+https://github.com/huggingface/diffusers`. The requirements specify `diffusers>=0.32.1`, but it is always recommended to use the `main` branch for the latest features and bugfixes.
+Clone the repository and make sure the requirements are installed: `pip install -r requirements.txt` and install `diffusers` from source by `pip install git+https://github.com/huggingface/diffusers`. The requirements specify `diffusers>=0.32.1`, but it is always recommended to use the `main` branch of Diffusers for the latest features and bugfixes. Note that the `main` branch for `finetrainers` is also the development branch, and stable support should be expected from the release tags.
 
-Then download a dataset:
+Checkout to the latest release tag:
 
 ```bash
-# install `huggingface_hub`
-huggingface-cli download \
-  --repo-type dataset Wild-Heart/Disney-VideoGeneration-Dataset \
-  --local-dir video-dataset-disney
+git fetch --all --tags
+git checkout tags/v0.0.1
 ```
 
-Then launch LoRA fine-tuning. Below we provide an example for LTX-Video. We refer the users to [`docs/training`](./docs/training/) to learn more details.
-
-> [!IMPORTANT] 
-> It is recommended to use Pytorch 2.5.1 or above for training. Previous versions can lead to completely black videos, OOM errors, or other issues and are not tested.
-
-<details>
-<summary>Training command</summary>
-
-TODO: LTX does not do too well with the disney dataset. We will update this to use a better example soon.
+Follow the instructions mentioned in the [README](https://github.com/a-r-r-o-w/finetrainers/tree/v0.0.1) for the release tag.
 
-```bash
-#!/bin/bash
-export WANDB_MODE="offline"
-export NCCL_P2P_DISABLE=1
-export TORCH_NCCL_ENABLE_MONITORING=0
-export FINETRAINERS_LOG_LEVEL=DEBUG
-
-GPU_IDS="0,1"
-
-DATA_ROOT="/path/to/video-dataset-disney"
-CAPTION_COLUMN="prompt.txt"
-VIDEO_COLUMN="videos.txt"
-OUTPUT_DIR="/path/to/output/directory/ltx-video/ltxv_disney"
-
-ID_TOKEN="BW_STYLE"
-
-# Model arguments
-model_cmd="--model_name ltx_video \
-  --pretrained_model_name_or_path Lightricks/LTX-Video"
-
-# Dataset arguments
-dataset_cmd="--data_root $DATA_ROOT \
-  --video_column $VIDEO_COLUMN \
-  --caption_column $CAPTION_COLUMN \
-  --id_token $ID_TOKEN \
-  --video_resolution_buckets 49x512x768 \
-  --caption_dropout_p 0.05"
-
-# Dataloader arguments
-dataloader_cmd="--dataloader_num_workers 0"
-
-# Diffusion arguments
-diffusion_cmd="--flow_weighting_scheme logit_normal"
-
-# Training arguments
-training_cmd="--training_type lora \
-  --seed 42 \
-  --batch_size 1 \
-  --train_steps 3000 \
-  --rank 128 \
-  --lora_alpha 128 \
-  --target_modules to_q to_k to_v to_out.0 \
-  --gradient_accumulation_steps 4 \
-  --gradient_checkpointing \
-  --checkpointing_steps 500 \
-  --checkpointing_limit 2 \
-  --enable_slicing \
-  --enable_tiling"
-
-# Optimizer arguments
-optimizer_cmd="--optimizer adamw \
-  --lr 3e-5 \
-  --lr_scheduler constant_with_warmup \
-  --lr_warmup_steps 100 \
-  --lr_num_cycles 1 \
-  --beta1 0.9 \
-  --beta2 0.95 \
-  --weight_decay 1e-4 \
-  --epsilon 1e-8 \
-  --max_grad_norm 1.0"
-
-# Miscellaneous arguments
-miscellaneous_cmd="--tracker_name finetrainers-ltxv \
-  --output_dir $OUTPUT_DIR \
-  --nccl_timeout 1800 \
-  --report_to wandb"
-
-cmd="accelerate launch --config_file accelerate_configs/uncompiled_2.yaml --gpu_ids $GPU_IDS train.py \
-  $model_cmd \
-  $dataset_cmd \
-  $dataloader_cmd \
-  $diffusion_cmd \
-  $training_cmd \
-  $optimizer_cmd \
-  $miscellaneous_cmd"
-
-echo "Running command: $cmd"
-eval $cmd
-echo -ne "-------------------- Finished executing script --------------------\n\n"
-```
+To get started quickly with example training scripts on the main development branch, refer to the following:
+- [LTX-Video Pika Effects Crush](./examples/training/ltx_video/)
 
-</details>
+The following are some simple datasets/HF orgs with good datasets to test training with quickly:
+- [Disney Video Generation Dataset](https://huggingface.co/datasets/Wild-Heart/Disney-VideoGeneration-Dataset)
+- [bigdatapw Video Dataset Collection](https://huggingface.co/bigdata-pw)
+- [Finetrainers HF Dataset Collection](https://huggingface.co/finetrainers)
 
-Here we are using two GPUs. But one can do single-GPU training by setting `GPU_IDS=0`. By default, we are using some simple optimizations to reduce memory consumption (such as gradient checkpointing). Please refer to [docs/training/optimizations](./docs/training/optimization.md) to learn about the memory optimizations currently supported.
+Please checkout [`docs/models`](./docs/models/) and [`examples/training`](./examples/training/) to learn more about supported models for training & example reproducible training launch scripts.
 
-For inference, refer [here](./docs/training/ltx_video.md#inference). For docs related to the other supported model, refer [here](./docs/training/).
+> [!IMPORTANT] 
+> It is recommended to use Pytorch 2.5.1 or above for training. Previous versions can lead to completely black videos, OOM errors, or other issues and are not tested.
 
 ## Support Matrix
 
+> [!NOTE]
+> The following numbers were obtained from the [release branch](https://github.com/a-r-r-o-w/finetrainers/tree/v0.0.1). The `main` branch is unstable at the moment and may use higher memory.
+
 <div align="center">
 
 | **Model Name**                                   | **Tasks**     | **Min. LoRA VRAM<sup>*</sup>**     | **Min. Full Finetuning VRAM<sup>^</sup>**     |
@@ -169,5 +87,5 @@ Checkout the following UIs built for `finetrainers`:
 
 ## Acknowledgements
 
-* `finetrainers` builds on top of a body of great open-source libraries: `transformers`, `accelerate`, `peft`, `diffusers`, `bitsandbytes`, `torchao`, `deepspeed` -- to name a few.
-* Some of the design choices were inspired by [`SimpleTuner`](https://github.com/bghira/SimpleTuner).
+* `finetrainers` builds on top of & takes inspiration from great open-source libraries - `transformers`, `accelerate`, `torchtune`, `torchtitan`, `peft`, `diffusers`, `bitsandbytes`, `torchao` and `deepspeed` - to name a few.
+* Some of the design choices of `finetrainers` were inspired by [`SimpleTuner`](https://github.com/bghira/SimpleTuner).
diff --git a/accelerate_configs/uncompiled_4.yaml b/accelerate_configs/uncompiled_4.yaml
@@ -0,0 +1,17 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: MULTI_GPU
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+gpu_ids: 0,1,2,3
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 4
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
diff --git a/docs/_NOTES_FOR_FUTURE_ME.md b/docs/_NOTES_FOR_FUTURE_ME.md
@@ -0,0 +1,20 @@
+# Notes for Future Me
+
+>![NOTE]
+> This doc page is intended for developers and contributors.
+
+FSDP dump:
+- https://pytorch.org/docs/stable/notes/fsdp.html#fsdp-notes
+- https://github.com/pytorch/pytorch/issues/114299
+- Using FSDP1 requires that all FSDP flat parameters are of the same dtype. For LoRA training, we default lora parameters to fp32 and transformer parameters to dtype chosen by user. There seems to be no easy workaround than performing lora training in same dtype.
+- https://github.com/pytorch/pytorch/issues/100945
+- https://github.com/pytorch/torchtune/blob/9b3836028fd0b48f593ea43474b86880c49a4d74/recipes/lora_finetune_distributed.py
+- https://github.com/KellerJordan/modded-nanogpt/pull/68
+- https://github.com/pytorch/pytorch/pull/125394: monkey-patch method for FSDP pre/post-hooks to be triggered for method other than `forward`
+- https://github.com/pytorch/pytorch/pull/127786:
+- https://github.com/pytorch/pytorch/pull/130949:
+- Sanity saver: create optimizers after parallelizing/activation-checkpointing models
+
+DTensor:
+- https://github.com/pytorch/pytorch/issues/88838
+- https://github.com/pytorch/pytorch/blob/main/test/distributed/tensor/parallel/test_parallelize_api.py