Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tune error for this version #728

Closed
4 tasks done
wangaocheng opened this issue Jan 17, 2025 · 23 comments
Closed
4 tasks done

Fine-tune error for this version #728

wangaocheng opened this issue Jan 17, 2025 · 23 comments
Labels
bug Something isn't working

Comments

@wangaocheng
Copy link

Checks

  • This template is only for bug reports, usage problems go with 'Help Wanted'.
  • I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
  • I have searched for existing issues, including closed ones, and couldn't find a solution.
  • I confirm that I am using English to submit this report in order to facilitate communication.

Environment Details

Docker

Steps to Reproduce

Startup successful, but unable to train.

(f5-tts) root@abd144ff4f98:/github/F5-TTS# f5-tts_finetune-gradio --host 0.0.0.0
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.402 seconds.
Prefix dict has been built successfully.
Word segmentation module jieba initialized.

Starting app...

To create a public link, set share=True in launch().
run command :
accelerate launch /github/F5-TTS/src/f5_tts/train/finetune_cli.py --exp_name F5TTS_Base --learning_rate 1e-05 --batch_size_per_gpu 1600 --batch_size_type frame --max_samples 64 --grad_accumulation_steps 1 --max_grad_norm 1 --epochs 5000 --num_warmup_updates 72 --save_per_updates 5000 --keep_last_n_checkpoints 5 --last_per_updates 18 --dataset_name wac --finetune --tokenizer pinyin --log_samples --logger wandb

The following values were not passed to accelerate launch and had defaults used instead:
--num_processes was set to a value of 2
More than one GPU was found, enabling multi-GPU training.
If this was unintended please pass in --num_processes=1.
--num_machines was set to a value of 1
--mixed_precision was set to a value of 'no'
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Word segmentation module jieba initialized.

Word segmentation module jieba initialized.

Loading model cost 0.415 seconds.
Prefix dict has been built successfully.
Loading model cost 0.417 seconds.
Prefix dict has been built successfully.

vocab : 3651

vocoder : vocos

vocab : 3651

vocoder : vocos
wandb: Currently logged in as: aochengwang. Use wandb login --relogin to force relogin
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: Tracking run with wandb version 0.19.2
wandb: Run data is saved locally in /github/F5-TTS/wandb/run-20250117_154459-s03chtpd
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run F5TTS_Base
wandb: ⭐️ View project at https://wandb.ai/aochengwang/wac
wandb: 🚀 View run at https://wandb.ai/aochengwang/wac/runs/s03chtpd
Using logger: wandb
Loading dataset ...
Loading dataset ...
Download Vocos from huggingface charactr/vocos-mel-24khz
Download Vocos from huggingface charactr/vocos-mel-24khz

Sorting with sampler... if slow, check whether dataset is provided with duration: 0%| | 0/709 [00:00<?, ?it/s]
Sorting with sampler... if slow, check whether dataset is provided with duration: 100%|██████████| 709/709 [00:00<00:00, 1855122.61it/s]

Creating dynamic batches with 1600 audio frames per gpu: 0%| | 0/709 [00:00<?, ?it/s]
Creating dynamic batches with 1600 audio frames per gpu: 100%|██████████| 709/709 [00:00<00:00, 2779216.39it/s]

Sorting with sampler... if slow, check whether dataset is provided with duration: 0%| | 0/709 [00:00<?, ?it/s]
Sorting with sampler... if slow, check whether dataset is provided with duration: 100%|██████████| 709/709 [00:00<00:00, 1870290.27it/s]

Creating dynamic batches with 1600 audio frames per gpu: 0%| | 0/709 [00:00<?, ?it/s]
Creating dynamic batches with 1600 audio frames per gpu: 100%|██████████| 709/709 [00:00<00:00, 2650411.35it/s]

Epoch 1/5000: 0%| | 0/76 [00:00<?, ?update/s][rank0]:[W reducer.cpp:1389] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank1]:[W reducer.cpp:1389] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())

Epoch 1/5000: 1%|▏ | 1/76 [00:02<02:55, 2.34s/update]

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

@wangaocheng wangaocheng added the bug Something isn't working label Jan 17, 2025
@ZhikangNiu
Copy link
Collaborator

So what is the question and bug?

@P-uck
Copy link

P-uck commented Jan 24, 2025

@ZhikangNiu

Hi, i think the problem is that on Multi GPUs , the finetune process freezes , I have the same problem in " 715" as well,

@kdcyberdude
Copy link

Hi @ZhikangNiu @SWivid @wangaocheng,
I'm encountering the same issue where training freezes on a multi-GPU setup, specifically with multiple RTX 4090s.

I've also tried disabling P2P without success:
Disabled P2P communication using NCCL_IB_DISABLE=1 and NCCL_P2P_DISABLE=1.

NCCL_IB_DISABLE=1 NCCL_P2P_DISABLE=1 accelerate launch --config_file ./accelerate_config_4-4090.yaml /wdc/proj/F5-TTS/src/f5_tts/train/finetune_cli.py \
  --exp_name F5TTS_Base \
  --learning_rate 7.5e-5 \
  --batch_size_per_gpu 9600 \
  --batch_size_type frame \
  --max_samples 64 \
  --grad_accumulation_steps 8 \
  --max_grad_norm 1 \
  --epochs 1865 \
  --num_warmup_updates 20000 \
  --save_per_updates 15000 \
  --last_per_updates 30000 \
  --dataset_name raag_tasariv \
  --finetune \
  --pretrain /wdc/proj/F5-TTS/ckpts/raag_tasariv_custom/model_1250000.pt \
  --tokenizer_path /wdc/proj/F5-TTS/data/raag_tasariv_custom/vocab.txt \
  --tokenizer custom \
  --log_samples

When using accelerate launch for other projects (e.g., parler-tts, w2v2-bert), I don't need to enable distributed training. Specifying DeepSpeed configurations and gpu_ids works fine during accelerate config
In this case, enabling distributed training utilizes all GPUs, but they seem unable to communicate properly, similar to this issue.

Error Logs Epoch 1/1865: 0%| | 1/1052 [00:02<41:01, 2.34s/update, loss=8.04, update=1][rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. [rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600085 milliseconds before timing out. [rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=ALLREDUCE, NumelIn=2202724, NumelOut=2202724, Timeout(ms)=600000) ran for 600000 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 61, last enqueued NCCL work: 61, last completed NCCL work: 60. [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7038404a6897 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7037f37aa1b2 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7037f37aefd0 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7037f37b031c in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xeabb4 (0x703841aeabb4 in /home/kd/anaconda3/envs/f5-tts/bin/../lib/libstdc++.so.6) frame #5: + 0x9ca94 (0x70384329ca94 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x129c3c (0x703843329c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600081 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7038404a6897 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7037f37aa1b2 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7037f37aefd0 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7037f37b031c in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xeabb4 (0x703841aeabb4 in /home/kd/anaconda3/envs/f5-tts/bin/../lib/libstdc++.so.6)
frame #5: + 0x9ca94 (0x70384329ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x129c3c (0x703843329c3c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7038404a6897 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe32e33 (0x7037f3432e33 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xeabb4 (0x703841aeabb4 in /home/kd/anaconda3/envs/f5-tts/bin/../lib/libstdc++.so.6)
frame #3: + 0x9ca94 (0x70384329ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x129c3c (0x703843329c3c in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 61, last enqueued NCCL work: 61, last completed NCCL work: 60.
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x787b50dc0897 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x787b041aa1b2 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x787b041aefd0 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x787b041b031c in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xeabb4 (0x787b522eabb4 in /home/kd/anaconda3/envs/f5-tts/bin/../lib/libstdc++.so.6)
frame #5: + 0x9ca94 (0x787b53c9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x129c3c (0x787b53d29c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x787b50dc0897 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x787b041aa1b2 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x787b041aefd0 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x787b041b031c in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xeabb4 (0x787b522eabb4 in /home/kd/anaconda3/envs/f5-tts/bin/../lib/libstdc++.so.6)
frame #5: + 0x9ca94 (0x787b53c9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x129c3c (0x787b53d29c3c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x787b50dc0897 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe32e33 (0x787b03e32e33 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xeabb4 (0x787b522eabb4 in /home/kd/anaconda3/envs/f5-tts/bin/../lib/libstdc++.so.6)
frame #3: + 0x9ca94 (0x787b53c9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x129c3c (0x787b53d29c3c in /lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 3] Timeout at NCCL work: 61, last enqueued NCCL work: 61, last completed NCCL work: 60.
[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7895c54f4897 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7895787aa1b2 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7895787aefd0 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7895787b031c in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xeabb4 (0x7895c6aeabb4 in /home/kd/anaconda3/envs/f5-tts/bin/../lib/libstdc++.so.6)
frame #5: + 0x9ca94 (0x7895c829ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x129c3c (0x7895c8329c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7895c54f4897 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7895787aa1b2 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7895787aefd0 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7895787b031c in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xeabb4 (0x7895c6aeabb4 in /home/kd/anaconda3/envs/f5-tts/bin/../lib/libstdc++.so.6)
frame #5: + 0x9ca94 (0x7895c829ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x129c3c (0x7895c8329c3c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7895c54f4897 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe32e33 (0x789578432e33 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xeabb4 (0x7895c6aeabb4 in /home/kd/anaconda3/envs/f5-tts/bin/../lib/libstdc++.so.6)
frame #3: + 0x9ca94 (0x7895c829ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x129c3c (0x7895c8329c3c in /lib/x86_64-linux-gnu/libc.so.6)

W0125 00:51:52.270000 129926476711744 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 28772 closing signal SIGTERM
W0125 00:51:52.271000 129926476711744 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 28774 closing signal SIGTERM
W0125 00:51:52.271000 129926476711744 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 28775 closing signal SIGTERM
E0125 00:51:54.603000 129926476711744 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 1 (pid: 28773) of binary: /home/kd/anaconda3/envs/f5-tts/bin/python
Traceback (most recent call last):
File "/home/kd/anaconda3/envs/f5-tts/bin/accelerate", line 8, in
sys.exit(main())
File "/home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1163, in launch_command
multi_gpu_launcher(args)
File "/home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/accelerate/commands/launch.py", line 792, in multi_gpu_launcher
distrib_run.run(args)
File "/home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/wdc/proj/F5-TTS/src/f5_tts/train/finetune_cli.py FAILED

Let me know if you need any additional information to help troubleshoot this issue.
I think there is some problem in accelerate configuration!!

@sarpba
Copy link

sarpba commented Jan 24, 2025

Same problem here too. Finetune hang on step one. (2xRTX3090+NVlink). The gpus power usage 150/370 Watt. Maybe stuck in loop forever.

@campar
Copy link

campar commented Jan 27, 2025

Is there any solution for finetuning on multiple GPUs?

@sarpba
Copy link

sarpba commented Jan 27, 2025

@campar

I just rolled back to 2024. december in the codebase

config the acclerate (i have only 2 gpu)

Accelerate config:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: 0,1
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

and start simply with "python ..." and not with accelerate launch

don't understand how, but it's working for me

@kdcyberdude
Copy link

Hi @sarpba,

  1. Which GPU's do you have?
  2. How did you pass the accelerate config while running the training directly using python (pls share full command if possible).
  3. Can you please mention the commit of December that you used to train on Multi GPU.
  4. What kind of error you are getting in the latest commit?

@sarpba
Copy link

sarpba commented Jan 27, 2025

@kdcyberdude

1, 2x3090
2, I don't know I think it's a bug in the code, but this is only way for use multigpu finetune

acelerate config

.../.cache/huggingface/accelerate/default_config.yaml

Accelerate config:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: 0,1
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
python src/f5_tts/train/finetune_gradio.py

3, I have an old clone for this repo on my old SSD, I use that now. not decemner it's from before: november 26, I used it in december sorry for mistake... If need I'll upload th old codebase somewhere
4. the fresh codebase just hang on step one forever. Just a guess, maybe the data loader isn't using DistributedSampler, all GPUs are trying to process the same data, which can lead to collisions or endless waits.

The old code is also buggy, but it works in an interesting way, and I get a usable model at the end of the fine-tuning.

I use ubuntu 22.04 and anaconda
pip_list.txt
cuda.txt

edit: important info, its's just working after I installed NVlink... (without NVlink it terrible slow, slower from one gpu)

@kdcyberdude
Copy link

@sarpba
Generally Multi GPU training is launched by accelerate, deepspeed, torchrun etc. Using python directly uses single GPU only.
I don't think it can internally by default fetch accelerate_config. In my case it started training on single GPU only.

@wangaocheng, Can you update the title of the issue to something more appropriate.

@sarpba
Copy link

sarpba commented Jan 27, 2025

@kdcyberdude I know that multi-GPU training works normally. I have trained LLM and Whisper models a lot in the last 1-1.5 years. My best practices don't work here for some reason. I don't understand the reason, but simply starting with Python starts both GPUs on 11/26/2024. daily code. All other attempts freeze :/

It wouldn't be a problem if someone who could do the multiGPU magic could throw in an accelerate config that could be used to make it work.

@kdcyberdude
Copy link

@sarpba Which one did you use train.py, finetune_cli.py or finetune_gradio.py?

@ZhikangNiu @SWivid Do you guys have any idea, what could be the reason!!

@hcsolakoglu
Copy link
Contributor

I encountered this bug too. I hadn't noticed it before since I was only training with a single A100, but when I rented 2 RTX 3090s and started training through finetune CLI with accelerate, it shows no progress after step 1, even though GPU utilization is 90%+ for both cards. I think this is an issue with accelerate not reporting the steps correctly, but I haven't done detailed debugging since I continued training with a single GPU. I'll follow this issue and open a PR if I find a solution.

@SWivid
Copy link
Owner

SWivid commented Jan 28, 2025

If the issue is caused by finetune-gradio but not finetune-cli, feel free to check if latest version works
tia~

@sarpba
Copy link

sarpba commented Jan 28, 2025

@SWivid I have same problem with finetune-cli.py & train.py too. I train a small model from scratch with 1 gpu only with train.py (lastest version). :/. I'll will never finish... :D

I tried it with multigpu, it hang forever at the first step.

@SWivid
Copy link
Owner

SWivid commented Jan 28, 2025

I tried it with multigpu, it hang forever at the first step.

Hi @sarpba , could you help provide the message when press ctrl-c after the hanging?

@sarpba
Copy link

sarpba commented Jan 28, 2025

@SWivid Hello, sure:

^CW0128 10:15:38.077000 126071620457536 torch/distributed/elastic/agent/server/api.py:741] Received Signals.SIGINT death signal, shutting down workers
W0128 10:15:38.077000 126071620457536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 56496 closing signal SIGINT
W0128 10:15:38.077000 126071620457536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 56497 closing signal SIGINT
^CW0128 10:15:38.277000 126071620457536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 56496 closing signal SIGTERM
W0128 10:15:38.277000 126071620457536 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 56497 closing signal SIGTERM
^CTraceback (most recent call last):
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run
    result = self._invoke_run(role)
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 876, in _invoke_run
    time.sleep(monitor_interval)
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 76, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 56470 got signal: 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 742, in run
    self._shutdown(e.sigval)
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 296, in _shutdown
    self._pcontext.close(death_sig)
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 541, in close
    self._close(death_sig=death_sig, timeout=timeout)
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _close
    handler.proc.wait(time_to_wait)
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/subprocess.py", line 1209, in wait
    return self._wait(timeout=timeout)
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/subprocess.py", line 1953, in _wait
    time.sleep(delay)
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 76, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 56470 got signal: 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sarpba/anaconda3/envs/f5-tts/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1163, in launch_command
    multi_gpu_launcher(args)
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/accelerate/commands/launch.py", line 792, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent
    result = agent.run()
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 747, in run
    self._shutdown()
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 296, in _shutdown
    self._pcontext.close(death_sig)
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 541, in close
    self._close(death_sig=death_sig, timeout=timeout)
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _close
    handler.proc.wait(time_to_wait)
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/subprocess.py", line 1209, in wait
    return self._wait(timeout=timeout)
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/subprocess.py", line 1953, in _wait
    time.sleep(delay)
  File "/home/sarpba/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 76, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 56470 got signal: 2

@P-uck
Copy link

P-uck commented Jan 28, 2025

@SWivid @ZhikangNiu

Hi ,

So I've been struggle with this problem for 1 week, the issue is where my distributed training setup hangs during NCCL AllReduce operations. Below is the part of the NCCL log :

Epoch 1/1742: 6%|█████ | 1/18 [00:04<01:16, 4.52s/update, loss=0.452, update=1] 0d0ebf8294f3:91788:91788 [0] NCCL INFO Broadcast: opCount 3c sendbuff 0x7767057ffa00 recvbuff 0x7767057ffa00 count 8 datatype 0 op 0 root 0 comm 0x5813c1f1b340 [nranks=2] stream 0x5813da6822f0 0d0ebf8294f3:91788:91788 [0] NCCL INFO 8 Bytes -> Algo 1 proto 2 time 14.100800 0d0ebf8294f3:91788:91788 [0] NCCL INFO Broadcast: opCount 3d sendbuff 0x7763b8000000 recvbuff 0x7763b8000000 count 8388736 datatype 0 op 0 root 0 comm 0x5813c1f1b340 [nranks=2] stream 0x5813da6822f0 0d0ebf8294f3:91788:91788 [0] NCCL INFO 8388736 Bytes -> Algo 1 proto 2 time 852.973572 ... 0d0ebf8294f3:91788:94129 [0] NCCL INFO AllReduce: opCount 3e sendbuff 0x77653aa00000 recvbuff 0x77653aa00000 count 2202724 datatype 7 op 0 root 0 comm 0x5813c1f1b340 [nranks=2] stream 0x5813da6822f0 0d0ebf8294f3:91788:94129 [0] NCCL INFO 8810896 Bytes -> Algo 1 proto 2 time 900.889587 ...
The log shows repeated large AllReduce operations, which take a lot of time and eventually cause the process to hang.
The training hangs during the first epoch while performing NCCL AllReduce operations. Each operation is large ( around 29MB) and takes several seconds to complete. Over time, this leads to a deadlock.

Environment:
PyTorch version: 2.0.1
CUDA version: 12.1
NCCL version: 2.21.5
GPUs: 2 x NVIDIA RTX 4090 24GB
(also tested with 2 x A100, 4 x 3090, and 2 x A5000 GPUs ....)

I think the issue might be one of these:

1- The connection or bandwidth between the GPUs isn’t enough to handle the large amount of data which produced by code.
2- There’s a problem with Distributed Data Parallel (DDP).

I reduced the batch size, but it didn’t help.

If you need more logs please let me know

thanks

@sarpba
Copy link

sarpba commented Jan 28, 2025

Hello @P-uck

1- The connection or bandwidth between the GPUs isn’t enough to handle the large amount of data which produced by code.

I have an active NVlink between the 2 cards, so I dont think.

nvidia-smi topo -m
	GPU0	GPU1	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV4	0-11	0		N/A
GPU1	NV4	 X 	0-11	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

@P-uck
Copy link

P-uck commented Jan 28, 2025

@sarpba
Hi,

The type of GPU connection, whether via PCI lanes or NVLink, shouldn't affect the functionality of accelerate if the code is properly written. While NVLink may provide better performance, it shouldn't cause the system to hang with other types of connections.

@hcsolakoglu
Copy link
Contributor

hcsolakoglu commented Jan 28, 2025

It probably won't solve it, but someone who has multiple GPUs could try this branch in my fork - I fixed the warning from ddp, maybe it could be a solution for the hanging issue... git clone -b fix-ddp-warning https://github.com/hcsolakoglu/F5-TTS.git
@sarpba @P-uck

@sarpba
Copy link

sarpba commented Jan 28, 2025

@hcsolakoglu there is no change

@SWivid
Copy link
Owner

SWivid commented Jan 29, 2025

@wangaocheng @P-uck @kdcyberdude @sarpba @campar @hcsolakoglu
Hey guys, thanks for reporting the bug and all efforts.

Thought latest version (0.5.0) fixed this bug, sorry for the inconvenience caused !

@sarpba
Copy link

sarpba commented Jan 29, 2025

Thank you! It's ok now!

@SWivid SWivid closed this as completed Jan 29, 2025
spygaurad pushed a commit to spygaurad/F5-TTS that referenced this issue Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants