-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low performance issue/question #5
Comments
Someone mentioned that the generation speed depended on the sampling rate and number of channels in the reference audio. Try resampling your audio to 24000hz mono and see if that changes anything. The model samples at 24k mono anyways so there shouldn't be any difference in quality |
Should have mentioned, I mostly tested with the included example.wav which seems to be 22khz mono. Poor performance with that too. |
@kanttouchthis you're asking the tts to do a lot of extra stuff it doesn't need to every time by making the call via tts.tts_to_file() Here's a short-hand reference implementation of what I have locally that usually is 1-2sec per 20 sec of audio on a 3090:
Run all that stuff outside of the usual loop in chat - that's init level stuff and should just be run once and stashed. Then on actual call:
|
Two things that might speed up your inferencing and voice outputs:
|
Thanks for the tips! Turning off HW accelerated GPU scheduling lowered "Real-time factor" from about 0.37-0.4 to 0.28-0.3. That's a pretty decent boost. I could also observe an increase in GPU usage during audio rendering. |
I have another observation, though Im going to open another ticket about it as a feature request. Ill try keep my explanation simple for here though. I have a 12GB card and loading a 13B model in that card uses 11.7GB of the VRAM, so only 300MB VRAM is left. My AI text generation is nice and fast 20 tokens a second. However, when it goes to process the audio, its clearly swapping the TTS model into the graphics VRAM, perhaps in chunks as you dont see any major memory changes. So to process say 4x lines of text with this setup can take 60 seconds. If I load a 7B model, which only takes about 8.5GB of my VRAM, because I now have 3.5GB of VRAM free, the TTS model can easily load into the VRAM without issue, in one nice lump. Generating the Audio output, now drops between 9 to 20 seconds! Which is fantastic...... though I'm now using a less powerful model! I tried editing the script.py and changing the references to "cuda" to "cpu" which loads the TTS into your system ram and processes it on you CPU, not your GPU. In my case I have an 8 core 16 thread CPU. Is CPU rendering faster when I'm using a 13B model and short on VRAM? Yes, just about, I think..... processing on my CPU in that situation may just be a bit faster, perhaps 10-15%. Obviously its NOT faster than processing when I am using a 7B model and my GPU with 3.5GB of VRAM spare. I'm thinking you need about 1.5GB of VRAM to fit the TTS in, maybe closer to 2GB to do it comfortably. So, it may be faster to use your CPU in some instances, depending on how much VRAM you have left after you have loaded your model and depending on how fast your CPU is. At this point in time, you can try on your own system and experiment, but I guess what Im saying is "If you dont have much VRAM left on your card, after loading your model, expect a slower processing time for Audio". If you edit the text-generation-webui\extensions\text_generation_webui_xtts\script.py file, to change the 3x "cuda" to "cpu" in there, you DO have to reload Text-Gen-WebUI. (unloading and reloading on the session tab may work, not tried it). |
Me neither i don't have the performance reported by RandomInternetPreson but when i use the realtimeTTS version installed in the same environment which use coqui engine i can generate few sentences in one sec but the realtimeTTS version is not recording any file and is not an extension integrated in ooba. It takes me 20 sec to generate 30 sec of audio on ooba with XTTS |
I can't replicate your results. For my sample text, the extension took 11.7 seconds. Your code took 10.9, but that is with the custom generation parameters. without those it also took 11.7 seconds. Your speedup likely comes from the fact that you're using deepspeed |
I too have reworked your code to accommodate the suggestion with no speed increase. However! I think I know the reason. https://tts.readthedocs.io/en/latest/models/xtts.html Check out this link, you need deepspeed enabled. I haven't done this yet, but I will today, I think I need to enable deepspeed in oobabooga to get it working. Look at oobs repo front page they have instructions on how to enable deepspeed. https://github.com/oobabooga/text-generation-webui#deepspeed |
unfortunately deepspeed isn't oficially supported on windows. could probably get it running on wsl though |
This is also on my list to try out, I tried doing it last night but was getting errors with a wsl install. The deepspeed documentation says that it should work in wsl. I wasn't loading on with --deepspeed though. |
Also according to the deepspeed documention it does work in Windows, with the caveat that it only works for inferencing. Which makes me think it might work on windows with oob since windows prebuilt wheels come installed. |
i'm able to run deepspeed on windows with python 3.9, i failed with python 3.10/11 |
@kanttouchthis yep, most of the big speed difference is from deepspeed; the other smaller chunk is likely the re-compute of the latents and embeddings when doing the 'clone' each time, but that's not a huge task. |
I'm seeing an example of 29s of audio rendered in ~3s, so about a 10:1 ratio on a 4090 here:
https://github.com/RandomInternetPreson/text_generation_webui_xtt_Alts/tree/main#example
But on my 4090 (+Ryzen 7600X) Win 11 system I'm seeing more like a 3:1 ratio.
Any ideas what's bottlenecking me? And anyone else seeing worse than expected performance?
The text was updated successfully, but these errors were encountered: