-
Notifications
You must be signed in to change notification settings - Fork 97
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* model specification * update * update * attempt ltx image condition * update * use image-to-video pipeline for ltx if image is provided * revert to dp replica working state * add debug note * make activation checkpointing work with fsdp * update * update * add accelerate backend * some qol improvements * update * update * refactor processors; distributed data processor; refactor user-facing api; ltx model-spec tests * update tests * update * remove dead code * update * update docs * handle checkpointing * update docs * fix fqn mapping error due to creating optimizers before activation checkpointing/parallelizing * custom arg parsing for trainer configs; doc improvements; gradient accumulation * improve dataset handling * update * improve dataset precomputation * add ltxvideo pika crush example * add back some changes * update * fix * fix test * update tests
- Loading branch information
Showing
140 changed files
with
7,943 additions
and
4,137 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -168,10 +168,9 @@ cython_debug/ | |
wandb/ | ||
*.txt | ||
dump* | ||
*dummy* | ||
outputs* | ||
*.slurm | ||
.vscode/ | ||
*.json | ||
*dummy* | ||
|
||
!requirements.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# How to contribute to Finetrainers | ||
|
||
Finetrainers is an early-stage library for training diffusion models. Everyone is welcome to contribute - models, algorithms, refactors, docs, etc. - but due to the early stage of the project, we recommend bigger contributions be discussed in an issue before submitting a PR. Eventually, we will have a better process for this! | ||
|
||
## How to contribute | ||
|
||
### Adding a new model | ||
|
||
If you would like to add a new model, please follow these steps: | ||
|
||
- Create a new file in the `finetrainers/models` directory with the model name (if it's new), or use the same directory if it's a variant of an existing model. | ||
- Implement the model specification in the file. For more details on what a model specification should look like, see the [ModelSpecification](TODO(aryan): add link) documentation. | ||
- Update the supported configs in `finetrainers/config.py` to include the new model and the training types supported. | ||
- Add a dummy model specification in the `tests/models` directory. | ||
- Make sure to test training with the following settings: | ||
- Single GPU | ||
- 2x GPU with `--dp_degree 2 --dp_shards 1` | ||
- 2x GPU with `--dp_degree 1 --dp_shards 2` | ||
|
||
For `SFTTrainer` additions, please make sure to train with atleast 1000 steps (atleast 2000 data points) to ensure the model training is working as expected. | ||
- Open a PR with your changes. Please make sure to share your wandb logs for the above training settings in the PR description. This will help us verify the training is working as expected. | ||
|
||
### Adding a new algorithm | ||
|
||
Currently, we are not accepting algorithm contributions. We will update this section once we are better ready 🤗 | ||
|
||
### Refactors | ||
|
||
The library is in a very early stage. There are many instances of dead code, poorly written abstractions, and other issues. If you would like to refactor/clean-up a part of the codebase, please open an issue to discuss the changes before submitting a PR. | ||
|
||
### Dataset improvements | ||
|
||
Any changes to dataset/dataloader implementations can be submitted directly. The improvements and reasons for the changes should be conveyed appropriately for us to move quickly 🤗 | ||
|
||
### Documentation | ||
|
||
Due to the early stage of the project, the documentation is not as comprehensive as we would like. Any improvements/refactors are welcome directly! | ||
|
||
## Asking for help | ||
|
||
If you have any questions, feel free to open an issue and we will be sure to help you out asap! Please make sure to describe your issues in either English (preferable) or Chinese. Any other language will make it hard for us to help you, so we will most likely close such issues without explanation/answer. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,11 @@ | ||
.PHONY: quality style | ||
|
||
check_dirs := finetrainers tests examples | ||
check_dirs := finetrainers tests examples train.py | ||
|
||
quality: | ||
ruff check $(check_dirs) --exclude examples/_legacy | ||
ruff format --check $(check_dirs) --exclude examples/_legacy | ||
|
||
style: | ||
ruff check $(check_dirs) --fix --exclude examples/_legacy | ||
ruff format $(check_dirs) --exclude examples/_legacy | ||
ruff format $(check_dirs) --exclude examples/_legacy |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
compute_environment: LOCAL_MACHINE | ||
debug: false | ||
distributed_type: MULTI_GPU | ||
downcast_bf16: 'no' | ||
enable_cpu_affinity: false | ||
gpu_ids: 0,1,2,3 | ||
machine_rank: 0 | ||
main_training_function: main | ||
mixed_precision: bf16 | ||
num_machines: 1 | ||
num_processes: 4 | ||
rdzv_backend: static | ||
same_network: true | ||
tpu_env: [] | ||
tpu_use_cluster: false | ||
tpu_use_sudo: false | ||
use_cpu: false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
# Notes for Future Me | ||
|
||
>![NOTE] | ||
> This doc page is intended for developers and contributors. | ||
FSDP dump: | ||
- https://pytorch.org/docs/stable/notes/fsdp.html#fsdp-notes | ||
- https://github.com/pytorch/pytorch/issues/114299 | ||
- Using FSDP1 requires that all FSDP flat parameters are of the same dtype. For LoRA training, we default lora parameters to fp32 and transformer parameters to dtype chosen by user. There seems to be no easy workaround than performing lora training in same dtype. | ||
- https://github.com/pytorch/pytorch/issues/100945 | ||
- https://github.com/pytorch/torchtune/blob/9b3836028fd0b48f593ea43474b86880c49a4d74/recipes/lora_finetune_distributed.py | ||
- https://github.com/KellerJordan/modded-nanogpt/pull/68 | ||
- https://github.com/pytorch/pytorch/pull/125394: monkey-patch method for FSDP pre/post-hooks to be triggered for method other than `forward` | ||
- https://github.com/pytorch/pytorch/pull/127786: | ||
- https://github.com/pytorch/pytorch/pull/130949: | ||
- Sanity saver: create optimizers after parallelizing/activation-checkpointing models | ||
|
||
DTensor: | ||
- https://github.com/pytorch/pytorch/issues/88838 | ||
- https://github.com/pytorch/pytorch/blob/main/test/distributed/tensor/parallel/test_parallelize_api.py |
Oops, something went wrong.