You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This template is only for usage issues encountered.
I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
I have searched for existing issues, including closed ones, and couldn't find a solution.
I confirm that I am using English to submit this report in order to facilitate communication.
Environment Details
python=3.10
Steps to Reproduce
Hi there, thanks for your amazing work on open-sourcing F5-TTS! While training F5 on my own dataset, I noticed an issue: the generated audio quality drops significantly when the dataset contains some audio clips longer than 20 seconds.
I saw there's a slice function in src/f5_tts/train/finetune_gradio.py that can split audio based on silent segments. So, my question is: when working with longer audio clips (e.g., over 20 seconds), do I need to slice them into shorter segments, translate them, and then use them for training? Or is it okay to include a small number of long audio clips directly? I'm planning to use a large dataset later, and slicing and translating all long audio files might be too much work.
✔️ Expected Behavior
No response
❌ Actual Behavior
No response
The text was updated successfully, but these errors were encountered:
Checks
Environment Details
python=3.10
Steps to Reproduce
Hi there, thanks for your amazing work on open-sourcing F5-TTS! While training F5 on my own dataset, I noticed an issue: the generated audio quality drops significantly when the dataset contains some audio clips longer than 20 seconds.
I saw there's a slice function in src/f5_tts/train/finetune_gradio.py that can split audio based on silent segments. So, my question is: when working with longer audio clips (e.g., over 20 seconds), do I need to slice them into shorter segments, translate them, and then use them for training? Or is it okay to include a small number of long audio clips directly? I'm planning to use a large dataset later, and slicing and translating all long audio files might be too much work.
✔️ Expected Behavior
No response
❌ Actual Behavior
No response
The text was updated successfully, but these errors were encountered: