-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trouble with Arabic in F5-TTS and Sample Outputs #715
Comments
Hi @P-uck
the sample saved with training is not using ema you could turn off when inference (or in infer_gradio/infer_cli pass in F5-TTS/src/f5_tts/infer/utils_infer.py Lines 223 to 232 in f992c4e
|
Would you be kind in sharing the FineTuned model of the Arabic language? |
Hi @SWivid, Thank you for your response. Now I got a clean dataset with 15 hours of data, but I’ve noticed my 4090 GPU is running quite slow. I’m wondering how I can set up multiple GPUs on an online platform like Vast ai. I tried using the Accelerator, but it didn’t work because all the GPUs are on a single workstation. @ftk789, I absolutely will but, I need to make some progress first because there are currently many issues with handling Arabic diacritics (Tashkil). |
That's so kind of you @P-uck . Thank you in advance. It'd be nice to see what kind of dataset are you training on and how good will it sound. but thank you in advance! |
@SWivid `run command : The following values were not passed to Loading model cost 0.535 seconds. Loading model cost 0.535 seconds. vocab : 2587 vocoder : vocos vocab : 2587 vocoder : vocos Sorting with sampler... if slow, check whether dataset is provided with duration: 0%| | 0/3764 [00:00<?, ?it/s] Creating dynamic batches with 6000 audio frames per gpu: 0%| | 0/3764 [00:00<?, ?it/s] Sorting with sampler... if slow, check whether dataset is provided with duration: 0%| | 0/3764 [00:00<?, ?it/s] Creating dynamic batches with 6000 audio frames per gpu: 0%| | 0/3764 [00:00<?, ?it/s] Epoch 1/1742: 0%| | 0/162 [00:00<?, ?update/s][rank1]:[W reducer.cpp:1389] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) Epoch 1/1742: 1%|▋ | 1/162 [00:01<04:50, 1.80s/update] |
Hi @P-uck , feel free to check if latest pull works~ |
Hi @SWivid Thanks for the update, I just tested it with 2×4090 and 4×3090, and it's working perfectly. after around 2500 epochs with ~8 hours of clear single speaker dataset(Arabic), the performance is good but I have two issues: At the beginning of the generated speech, I can hear a faint voice that sounds like the end of the reference text and voice. Why does this happen, and how can I remove it? When generating longer sentences (over 25–30 seconds), the quality drops significantly after ~25-30 seconds, with missing words and overall degradation. Do you have any suggestions to improve this? |
Hi @P-uck it's OK with During actual inference, a preprocess of reference audio will add proper silence clip to tail, avoid the leak of prompt to certain extent. |
Thanks, its cleared for me now. |
I assume that you used the base model for training, so your finished model should be able to speak English and Arabic, right? Would it be possible to use both languages in one sentence? would it recognize that it is now a different language and then pronounce the word in Arabic? If not, how would it be possible? My goal is actually to change the pronunciation of some English words that are of Arabic origin. |
hey man , it seems you finished the arabic model , can you share it with us please |
Checks
Question details
Hi everyone,
I’ve been working with F5-TTS for the past week to fine-tune it on Arabic lang. My dataset is around 5 hours of very clean, diverse, single-speaker. my GPU is 4090 RTX and I used around 1000 epoches for this,
The problem is that during training, after about 50% of the epochs pass, the sample outputs in ckpts/Arabic_new/samples/ start sounding good—almost identical to the base data. But when the training is done, and I try generating audio using the final checkpoints, the output becomes pretty much unintelligible. The tone and style are fine, but you can’t understand the words at all. I’ve tried both the Gradio interface and CLI tools with the same checkpoints, but no luck—they both give me the same garbled output.
I check the code, specially how it generates these sample outputs during training. from what I see it uses intermediate checkpoints to generate mel-spectrograms for the given gen_text and then passes those through the vocoder.
So, if the samples during training are generated like this, why can’t I recreate the same quality when I use the final checkpoints? Am I missing something obvious here?
Also, can I learn F5-TTS for Arabic with its complex phonemes and vocabulary using only 5 hours of dataset? Has anyone tried something similar and had success? Maybe there are some specific settings I should tweak?
Thanks
The text was updated successfully, but these errors were encountered: