Trouble with Arabic in F5-TTS and Sample Outputs #715

P-uck · 2025-01-13T18:14:13Z

Checks

This template is only for question, not feature requests or bug reports.
I have thoroughly reviewed the project documentation and read the related paper(s).
I have searched for existing issues, including closed ones, no similar questions.
I confirm that I am using English to submit this report in order to facilitate communication.

Question details

Hi everyone,

I’ve been working with F5-TTS for the past week to fine-tune it on Arabic lang. My dataset is around 5 hours of very clean, diverse, single-speaker. my GPU is 4090 RTX and I used around 1000 epoches for this,

The problem is that during training, after about 50% of the epochs pass, the sample outputs in ckpts/Arabic_new/samples/ start sounding good—almost identical to the base data. But when the training is done, and I try generating audio using the final checkpoints, the output becomes pretty much unintelligible. The tone and style are fine, but you can’t understand the words at all. I’ve tried both the Gradio interface and CLI tools with the same checkpoints, but no luck—they both give me the same garbled output.

I check the code, specially how it generates these sample outputs during training. from what I see it uses intermediate checkpoints to generate mel-spectrograms for the given gen_text and then passes those through the vocoder.
So, if the samples during training are generated like this, why can’t I recreate the same quality when I use the final checkpoints? Am I missing something obvious here?

Also, can I learn F5-TTS for Arabic with its complex phonemes and vocabulary using only 5 hours of dataset? Has anyone tried something similar and had success? Maybe there are some specific settings I should tweak?

Thanks

SWivid · 2025-01-14T07:14:53Z

Hi @P-uck

Try turn off use_ema if using an early-stage finetuned checkpoint (which goes just few updates).

uses intermediate checkpoints to generate mel-spectrograms for the given gen_text

the sample saved with training is not using ema

you could turn off when inference (or in infer_gradio/infer_cli pass in use_ema=False):

F5-TTS/src/f5_tts/infer/utils_infer.py

Lines 223 to 232 in f992c4e

    
           def load_model( 
        
               model_cls, 
        
               model_cfg, 
        
               ckpt_path, 
        
               mel_spec_type=mel_spec_type, 
        
               vocab_file="", 
        
               ode_method=ode_method, 
        
               use_ema=True, 
        
               device=device, 
        
           ):

ftk789 · 2025-01-15T23:19:47Z

Would you be kind in sharing the FineTuned model of the Arabic language?

P-uck · 2025-01-21T23:27:39Z

Hi @SWivid,

Thank you for your response.

Now I got a clean dataset with 15 hours of data, but I’ve noticed my 4090 GPU is running quite slow. I’m wondering how I can set up multiple GPUs on an online platform like Vast ai. I tried using the Accelerator, but it didn’t work because all the GPUs are on a single workstation.

@ftk789, I absolutely will but, I need to make some progress first because there are currently many issues with handling Arabic diacritics (Tashkil).

ftk789 · 2025-01-22T05:56:59Z

@ftk789, I absolutely will but, I need to make some progress first because there are currently many issues with handling Arabic diacritics (Tashkil).

That's so kind of you @P-uck . Thank you in advance. It'd be nice to see what kind of dataset are you training on and how good will it sound. but thank you in advance!

P-uck · 2025-01-22T19:57:23Z

@SWivid
update:
I tired to run on 2* 4090 GPUs, but it kind of freez:
i tired also conf.yaml for hydra but no luck

`run command :
accelerate launch --mixed_precision=fp16 /workspace/F5-TTS/src/f5_tts/train/finetune_cli.py --exp_name F5TTS_Base --learning_rate 7.5e-05 --batch_size_per_gpu 6000 --batch_size_type frame --max_samples 64 --grad_accumulation_steps 1 --max_grad_norm 1 --epochs 1742 --num_warmup_updates 188 --save_per_updates 376 --keep_last_n_checkpoints -1 --last_per_updates 94 --dataset_name ahmad4090 --tokenizer char --log_samples --logger wandb --bnb_optimizer

The following values were not passed to accelerate launch and had defaults used instead:
--num_processes was set to a value of 2
More than one GPU was found, enabling multi-GPU training.
If this was unintended please pass in --num_processes=1.
--num_machines was set to a value of 1
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Word segmentation module jieba initialized.

Loading model cost 0.535 seconds.
Prefix dict has been built successfully.
Word segmentation module jieba initialized.

Loading model cost 0.535 seconds.
Prefix dict has been built successfully.

vocab : 2587

vocoder : vocos

vocab : 2587

vocoder : vocos
Using logger: None
Loading dataset ...
Loading dataset ...
/opt/conda/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/opt/conda/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
/opt/conda/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/opt/conda/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
Download Vocos from huggingface charactr/vocos-mel-24khz
Download Vocos from huggingface charactr/vocos-mel-24khz

Sorting with sampler... if slow, check whether dataset is provided with duration: 0%| | 0/3764 [00:00<?, ?it/s]
Sorting with sampler... if slow, check whether dataset is provided with duration: 100%|█████████████████████████████████████| 3764/3764 [00:00<00:00, 1664806.52it/s]

Creating dynamic batches with 6000 audio frames per gpu: 0%| | 0/3764 [00:00<?, ?it/s]
Creating dynamic batches with 6000 audio frames per gpu: 100%|██████████████████████████████████████████████████████████████| 3764/3764 [00:00<00:00, 2439709.51it/s]

Sorting with sampler... if slow, check whether dataset is provided with duration: 0%| | 0/3764 [00:00<?, ?it/s]
Sorting with sampler... if slow, check whether dataset is provided with duration: 100%|█████████████████████████████████████| 3764/3764 [00:00<00:00, 1648121.96it/s]

Creating dynamic batches with 6000 audio frames per gpu: 0%| | 0/3764 [00:00<?, ?it/s]
Creating dynamic batches with 6000 audio frames per gpu: 100%|██████████████████████████████████████████████████████████████| 3764/3764 [00:00<00:00, 2840475.04it/s]

Epoch 1/1742: 0%| | 0/162 [00:00<?, ?update/s][rank1]:[W reducer.cpp:1389] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W reducer.cpp:1389] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())

Epoch 1/1742: 1%|▋ | 1/162 [00:01<04:50, 1.80s/update]
`

SWivid · 2025-01-27T14:17:58Z

Hi @P-uck busy last week, have you got it work now?
Just have some fixes merged, #729 #741

P-uck · 2025-01-28T11:10:30Z

@SWivid

Hi,
not yet I tried different ways but I think there is a hidden problem with DDP, please check 728

I put the logs there.

thanks

SWivid · 2025-01-29T07:22:30Z

Hi @P-uck , feel free to check if latest pull works~

P-uck · 2025-01-29T09:16:39Z

Hi @SWivid

Thanks for the update, I just tested it with 2×4090 and 4×3090, and it's working perfectly.

after around 2500 epochs with ~8 hours of clear single speaker dataset(Arabic), the performance is good but I have two issues:

At the beginning of the generated speech, I can hear a faint voice that sounds like the end of the reference text and voice. Why does this happen, and how can I remove it?

When generating longer sentences (over 25–30 seconds), the quality drops significantly after ~25-30 seconds, with missing words and overall degradation. Do you have any suggestions to improve this?

SWivid · 2025-01-29T09:32:37Z

Hi @P-uck it's OK with log_samples during training (that doesn't affect actual inference)
See #719 .

During actual inference, a preprocess of reference audio will add proper silence clip to tail, avoid the leak of prompt to certain extent.
And ref_audio will be clipped short to 15 second, and ensuring the total duration (ref_audio + gen_audio) < 30 seconds (which is the max length our pretrained model seen during training).
Here log_samples is just a simple duplicate test for training, so the max length is likely exceeded.

P-uck · 2025-01-29T09:39:32Z

@SWivid

Thanks, its cleared for me now.

HT5757 · 2025-02-08T19:03:45Z

I assume that you used the base model for training, so your finished model should be able to speak English and Arabic, right? Would it be possible to use both languages in one sentence?
For example: "Hello my name is إبراهيم"

would it recognize that it is now a different language and then pronounce the word in Arabic?

If not, how would it be possible?

My goal is actually to change the pronunciation of some English words that are of Arabic origin.

SNAKEIX · 2025-03-19T06:43:12Z

@SWivid

Thanks, its cleared for me now.

hey man , it seems you finished the arabic model , can you share it with us please

P-uck added the question Further information is requested label Jan 13, 2025

P-uck changed the title ~~New languages Train, Arabic , serious problems~~ Trouble with Arabic in F5-TTS and Sample Outputs Jan 13, 2025

P-uck mentioned this issue Jan 24, 2025

Fine-tune error for this version #728

Closed

4 tasks

SWivid added a commit that referenced this issue Jan 29, 2025

0.5.0 fix grad_accum bug from 0.4.0, #715 #728

f099649

spygaurad pushed a commit to spygaurad/F5-TTS that referenced this issue Feb 28, 2025

0.5.0 fix grad_accum bug from 0.4.0, SWivid#715 SWivid#728

4c323d5

ZhikangNiu closed this as completed Mar 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trouble with Arabic in F5-TTS and Sample Outputs #715

Trouble with Arabic in F5-TTS and Sample Outputs #715

P-uck commented Jan 13, 2025

SWivid commented Jan 14, 2025 •

edited

Loading

ftk789 commented Jan 15, 2025

P-uck commented Jan 21, 2025

ftk789 commented Jan 22, 2025

P-uck commented Jan 22, 2025

SWivid commented Jan 27, 2025

P-uck commented Jan 28, 2025

SWivid commented Jan 29, 2025

P-uck commented Jan 29, 2025 •

edited

Loading

SWivid commented Jan 29, 2025

P-uck commented Jan 29, 2025

HT5757 commented Feb 8, 2025 •

edited

Loading

SNAKEIX commented Mar 19, 2025

Trouble with Arabic in F5-TTS and Sample Outputs #715

Trouble with Arabic in F5-TTS and Sample Outputs #715

Comments

P-uck commented Jan 13, 2025

Checks

Question details

SWivid commented Jan 14, 2025 • edited Loading

ftk789 commented Jan 15, 2025

P-uck commented Jan 21, 2025

ftk789 commented Jan 22, 2025

P-uck commented Jan 22, 2025

SWivid commented Jan 27, 2025

P-uck commented Jan 28, 2025

SWivid commented Jan 29, 2025

P-uck commented Jan 29, 2025 • edited Loading

SWivid commented Jan 29, 2025

P-uck commented Jan 29, 2025

HT5757 commented Feb 8, 2025 • edited Loading

SNAKEIX commented Mar 19, 2025

SWivid commented Jan 14, 2025 •

edited

Loading

P-uck commented Jan 29, 2025 •

edited

Loading

HT5757 commented Feb 8, 2025 •

edited

Loading