Does it allow detecting word or sentence boundaries during TTS generation? #887

securealex · 2025-03-20T08:07:45Z

Checks

This template is only for usage issues encountered.
I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
I have searched for existing issues, including closed ones, and couldn't find a solution.
I am using English to submit this issue to facilitate community communication.

Environment Details

Ubuntu 22, Python 11.

Dear All,

Is there any way to trigger events on word or sentence boundaries during generation? Or is it possible to detect the boundaries with the result?

Thanks

Steps to Reproduce

Call the API with python to generate voice for text

✔️ Expected Behavior

It would trigger events on the boundaries to handle, or produce meta info along with the tts result.

❌ Actual Behavior

It produces bytes array without meta info or events

SWivid · 2025-03-20T11:03:39Z

Can you provide your screenshot of cli when you do inference, thought there is already gen_texts shown (you could just go for corresponding code in utils_infer.py)

securealex · 2025-03-20T13:28:06Z

The current inference is based on pre-segmented text. It can split input text at the sentence level for processing, but this approach may not be suitable for word-level splitting, as individual words might be too short for efficient TTS generation. Moreover, the performance of inference after segmentation suffers a loss; the inference for the same text becomes slower after segmentation. Therefore, it would be ideal if the inference process itself could align the results with words or sentences. This is similar to the functionality provided by the standard Web Speech API in browsers, which supports event callback processing at both the word and sentence levels.

SWivid · 2025-03-20T13:31:46Z

So you want a explictly aligned synthesis? Could refer to Voicebox and Matcha-TTS e.g. if you like that way.

securealex added the help wanted label Mar 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does it allow detecting word or sentence boundaries during TTS generation? #887

Does it allow detecting word or sentence boundaries during TTS generation? #887

securealex commented Mar 20, 2025

SWivid commented Mar 20, 2025

securealex commented Mar 20, 2025 •

edited

Loading

SWivid commented Mar 20, 2025

Does it allow detecting word or sentence boundaries during TTS generation? #887

Does it allow detecting word or sentence boundaries during TTS generation? #887

Comments

securealex commented Mar 20, 2025

Checks

Environment Details

Steps to Reproduce

✔️ Expected Behavior

❌ Actual Behavior

SWivid commented Mar 20, 2025

securealex commented Mar 20, 2025 • edited Loading

SWivid commented Mar 20, 2025

securealex commented Mar 20, 2025 •

edited

Loading