-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine-tune error for this version #728
Comments
So what is the question and bug? |
Hi, i think the problem is that on Multi GPUs , the finetune process freezes , I have the same problem in " 715" as well, |
Hi @ZhikangNiu @SWivid @wangaocheng, I've also tried disabling P2P without success: NCCL_IB_DISABLE=1 NCCL_P2P_DISABLE=1 accelerate launch --config_file ./accelerate_config_4-4090.yaml /wdc/proj/F5-TTS/src/f5_tts/train/finetune_cli.py \
--exp_name F5TTS_Base \
--learning_rate 7.5e-5 \
--batch_size_per_gpu 9600 \
--batch_size_type frame \
--max_samples 64 \
--grad_accumulation_steps 8 \
--max_grad_norm 1 \
--epochs 1865 \
--num_warmup_updates 20000 \
--save_per_updates 15000 \
--last_per_updates 30000 \
--dataset_name raag_tasariv \
--finetune \
--pretrain /wdc/proj/F5-TTS/ckpts/raag_tasariv_custom/model_1250000.pt \
--tokenizer_path /wdc/proj/F5-TTS/data/raag_tasariv_custom/vocab.txt \
--tokenizer custom \
--log_samples
When using accelerate launch for other projects (e.g., Error LogsEpoch 1/1865: 0%| | 1/1052 [00:02<41:01, 2.34s/update, loss=8.04, update=1][rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. [rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600085 milliseconds before timing out. [rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63, OpType=ALLREDUCE, NumelIn=2202724, NumelOut=2202724, Timeout(ms)=600000) ran for 600000 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 61, last enqueued NCCL work: 61, last completed NCCL work: 60. [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7038404a6897 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7037f37aa1b2 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7037f37aefd0 in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7037f37b031c in /home/kd/anaconda3/envs/f5-tts/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xeabb4 (0x703841aeabb4 in /home/kd/anaconda3/envs/f5-tts/bin/../lib/libstdc++.so.6) frame #5: + 0x9ca94 (0x70384329ca94 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x129c3c (0x703843329c3c in /lib/x86_64-linux-gnu/libc.so.6)terminate called after throwing an instance of 'c10::DistBackendError' Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 61, last enqueued NCCL work: 61, last completed NCCL work: 60. terminate called after throwing an instance of 'c10::DistBackendError' Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 3] Timeout at NCCL work: 61, last enqueued NCCL work: 61, last completed NCCL work: 60. terminate called after throwing an instance of 'c10::DistBackendError' Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): W0125 00:51:52.270000 129926476711744 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 28772 closing signal SIGTERM
|
Same problem here too. Finetune hang on step one. (2xRTX3090+NVlink). The gpus power usage 150/370 Watt. Maybe stuck in loop forever. |
Is there any solution for finetuning on multiple GPUs? |
I just rolled back to 2024. december in the codebase config the acclerate (i have only 2 gpu)
and start simply with "python ..." and not with accelerate launch don't understand how, but it's working for me |
Hi @sarpba,
|
1, 2x3090
.../.cache/huggingface/accelerate/default_config.yaml
3, I have an old clone for this repo on my old SSD, I use that now. not decemner it's from before: november 26, I used it in december sorry for mistake... If need I'll upload th old codebase somewhere The old code is also buggy, but it works in an interesting way, and I get a usable model at the end of the fine-tuning. I use ubuntu 22.04 and anaconda edit: important info, its's just working after I installed NVlink... (without NVlink it terrible slow, slower from one gpu) |
@sarpba @wangaocheng, Can you update the title of the issue to something more appropriate. |
@kdcyberdude I know that multi-GPU training works normally. I have trained LLM and Whisper models a lot in the last 1-1.5 years. My best practices don't work here for some reason. I don't understand the reason, but simply starting with Python starts both GPUs on 11/26/2024. daily code. All other attempts freeze :/ It wouldn't be a problem if someone who could do the multiGPU magic could throw in an accelerate config that could be used to make it work. |
@sarpba Which one did you use @ZhikangNiu @SWivid Do you guys have any idea, what could be the reason!! |
I encountered this bug too. I hadn't noticed it before since I was only training with a single A100, but when I rented 2 RTX 3090s and started training through finetune CLI with accelerate, it shows no progress after step 1, even though GPU utilization is 90%+ for both cards. I think this is an issue with accelerate not reporting the steps correctly, but I haven't done detailed debugging since I continued training with a single GPU. I'll follow this issue and open a PR if I find a solution. |
If the issue is caused by finetune-gradio but not finetune-cli, feel free to check if latest version works |
@SWivid I have same problem with finetune-cli.py & train.py too. I train a small model from scratch with 1 gpu only with train.py (lastest version). :/. I'll will never finish... :D I tried it with multigpu, it hang forever at the first step. |
Hi @sarpba , could you help provide the message when press ctrl-c after the hanging? |
@SWivid Hello, sure:
|
Hi , So I've been struggle with this problem for 1 week, the issue is where my distributed training setup hangs during NCCL AllReduce operations. Below is the part of the NCCL log :
Environment: I think the issue might be one of these: 1- The connection or bandwidth between the GPUs isn’t enough to handle the large amount of data which produced by code. I reduced the batch size, but it didn’t help. If you need more logs please let me know thanks |
Hello @P-uck
I have an active NVlink between the 2 cards, so I dont think.
|
@sarpba The type of GPU connection, whether via PCI lanes or NVLink, shouldn't affect the functionality of accelerate if the code is properly written. While NVLink may provide better performance, it shouldn't cause the system to hang with other types of connections. |
@hcsolakoglu there is no change |
@wangaocheng @P-uck @kdcyberdude @sarpba @campar @hcsolakoglu Thought latest version (0.5.0) fixed this bug, sorry for the inconvenience caused ! |
Thank you! It's ok now! |
Checks
Environment Details
Docker
Steps to Reproduce
Startup successful, but unable to train.
(f5-tts) root@abd144ff4f98:/github/F5-TTS# f5-tts_finetune-gradio --host 0.0.0.0
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.402 seconds.
Prefix dict has been built successfully.
Word segmentation module jieba initialized.
Starting app...
To create a public link, set
share=True
inlaunch()
.run command :
accelerate launch /github/F5-TTS/src/f5_tts/train/finetune_cli.py --exp_name F5TTS_Base --learning_rate 1e-05 --batch_size_per_gpu 1600 --batch_size_type frame --max_samples 64 --grad_accumulation_steps 1 --max_grad_norm 1 --epochs 5000 --num_warmup_updates 72 --save_per_updates 5000 --keep_last_n_checkpoints 5 --last_per_updates 18 --dataset_name wac --finetune --tokenizer pinyin --log_samples --logger wandb
The following values were not passed to
accelerate launch
and had defaults used instead:--num_processes
was set to a value of2
More than one GPU was found, enabling multi-GPU training.
If this was unintended please pass in
--num_processes=1
.--num_machines
was set to a value of1
--mixed_precision
was set to a value of'no'
--dynamo_backend
was set to a value of'no'
To avoid this warning pass in values for each of the problematic parameters or run
accelerate config
.Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Word segmentation module jieba initialized.
Word segmentation module jieba initialized.
Loading model cost 0.415 seconds.
Prefix dict has been built successfully.
Loading model cost 0.417 seconds.
Prefix dict has been built successfully.
vocab : 3651
vocoder : vocos
vocab : 3651
vocoder : vocos
wandb: Currently logged in as: aochengwang. Use
wandb login --relogin
to force reloginwandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: Tracking run with wandb version 0.19.2
wandb: Run data is saved locally in /github/F5-TTS/wandb/run-20250117_154459-s03chtpd
wandb: Run
wandb offline
to turn off syncing.wandb: Syncing run F5TTS_Base
wandb: ⭐️ View project at https://wandb.ai/aochengwang/wac
wandb: 🚀 View run at https://wandb.ai/aochengwang/wac/runs/s03chtpd
Using logger: wandb
Loading dataset ...
Loading dataset ...
Download Vocos from huggingface charactr/vocos-mel-24khz
Download Vocos from huggingface charactr/vocos-mel-24khz
Sorting with sampler... if slow, check whether dataset is provided with duration: 0%| | 0/709 [00:00<?, ?it/s]
Sorting with sampler... if slow, check whether dataset is provided with duration: 100%|██████████| 709/709 [00:00<00:00, 1855122.61it/s]
Creating dynamic batches with 1600 audio frames per gpu: 0%| | 0/709 [00:00<?, ?it/s]
Creating dynamic batches with 1600 audio frames per gpu: 100%|██████████| 709/709 [00:00<00:00, 2779216.39it/s]
Sorting with sampler... if slow, check whether dataset is provided with duration: 0%| | 0/709 [00:00<?, ?it/s]
Sorting with sampler... if slow, check whether dataset is provided with duration: 100%|██████████| 709/709 [00:00<00:00, 1870290.27it/s]
Creating dynamic batches with 1600 audio frames per gpu: 0%| | 0/709 [00:00<?, ?it/s]
Creating dynamic batches with 1600 audio frames per gpu: 100%|██████████| 709/709 [00:00<00:00, 2650411.35it/s]
Epoch 1/5000: 0%| | 0/76 [00:00<?, ?update/s][rank0]:[W reducer.cpp:1389] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank1]:[W reducer.cpp:1389] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Epoch 1/5000: 1%|▏ | 1/76 [00:02<02:55, 2.34s/update]
✔️ Expected Behavior
No response
❌ Actual Behavior
No response
The text was updated successfully, but these errors were encountered: