Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3D Parallel + Model Spec API #245

Merged
merged 36 commits into from
Mar 3, 2025
Merged

3D Parallel + Model Spec API #245

merged 36 commits into from
Mar 3, 2025

Conversation

a-r-r-o-w
Copy link
Owner

@a-r-r-o-w a-r-r-o-w commented Jan 25, 2025

Model Specification

TODO

Parallel Backends

Finetrainers supports parallel training on multiple GPUs & nodes. This is done using the Pytorch DTensor backend.

As an experiment for comparing performance of different training backends, I've implemented multi-backend support. These backends may or may not fully rely on Pytorch's distributed DTensor solution. Currently, only 🤗 Accelerate is supported for backwards-compatibility reasons (as we initially started with only Accelerate support). In the near future, there are plans for integrating natively with:

Native support for context-parallel and pipeline-parallel, based on Pytorch DTensor and custom ParaAttention & xDiT inspired solutions is also planned. Note that this is just for experimental purposes to satisfy my curiosity and questions regarding performance of different frameworks. Users should only expect stable support with accelerate and pytorch dtensor.

Support matrix currently in this PR that have been verified to work:

Backend DDP FSDP1 FSDP2 HSDP TP PP CP
PTD 🤗 😡 🤗 🤗 🤗 😡 😡
Accelerate 🤗 😡 😡 😡 😡 😡 😡

Check the docs for more information.

Training improvements

  • More information is logged to wandb for better model performance tracking
  • Users can now modify targets and sigmas in the model specification's forward method for more control over what they are optimizing for
  • Logs can be logged at a chosen interval duration instead of for each step
  • (postponed, will be supported in diffusers directly) Flash attention
  • (postponed, due to poor results and NCCL hangs): Native H100 fp8 training using TorchAO
  • (postponed, due to requiring debugging to make general purpose for all models): Pipeline parallel

Precomputation

A new mechanism for preprocessing and batched training has been implemented so that medium-to-large scale datasets can be handled efficiently without using too much disk space. The DistributedDataProcessor is fed in the dataloader iterators and processes a fixed batch of --precomputation_items and saves them to --precomputation_dir. These items can then be used for batched training based on frame-height-width bucket collation. By default, 512 items are precomputed but should be adjusted by users based on available disk space and scale of training.

Processors

This PR introduces a ProcessorMixin. It's an attempt to provide a standard interface for creating graph-based dataset manipulation from source data to input data for condition/latents models. A processor should ideally do very simple functionality and one thing only, so that multiple processors can be composed together. Processors should be invoked from the ModelSpecification::prepare_latents and ModelSpecification::prepare_conditions methods. Users are not required to use them (as in, this is opt-in), so can use any custom logic for preprocessing.

TODO: show an example

Environment

Currently, the only tested functional environment is as follows (output obtained from diffusers-cli env)

Finetrainers has only been widely tested with the following environment (output obtained by running diffusers-cli env):

- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.14
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): 0.8.5 (cpu)
- Jax version: 0.4.31
- JaxLib version: 0.4.31
- Huggingface_hub version: 0.28.1
- Transformers version: 4.48.0.dev0
- Accelerate version: 1.1.0.dev0
- PEFT version: 0.14.1.dev0
- Bitsandbytes version: 0.43.3
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA DGX Display, 4096 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB

Other changes

  • Adds support for LTX first-frame conditioning
  • Allows usage of arbitrary text encoders, tokenizers, transformers outside of the pretrained_model_name_or_path
  • Allows more flexibility in the calculation of prediction and targets, so that we don't have to workaround for models like Mochi

@a-r-r-o-w
Copy link
Owner Author

Actually, no urgent reason to merge this soon. Will get DDP/FSDP, PP and TP working together atleast, since if we are going to be flaky for a bit, it's best to have all features available for testing and work my way through bugs

@a-r-r-o-w a-r-r-o-w changed the title Model Specification API 4D Parallel + Model Spec API Jan 31, 2025
@a-r-r-o-w a-r-r-o-w marked this pull request as ready for review March 3, 2025 10:27
This was referenced Mar 3, 2025
@a-r-r-o-w a-r-r-o-w changed the title 4D Parallel + Model Spec API 3D Parallel + Model Spec API Mar 3, 2025
@a-r-r-o-w a-r-r-o-w merged commit 9bb9aff into main Mar 3, 2025
1 check passed
@a-r-r-o-w a-r-r-o-w deleted the model-spec branch March 3, 2025 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant