Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does it allow detecting word or sentence boundaries during TTS generation? #887

Open
4 tasks done
securealex opened this issue Mar 20, 2025 · 3 comments
Open
4 tasks done
Labels
help wanted Extra attention is needed

Comments

@securealex
Copy link

Checks

  • This template is only for usage issues encountered.
  • I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
  • I have searched for existing issues, including closed ones, and couldn't find a solution.
  • I am using English to submit this issue to facilitate community communication.

Environment Details

Ubuntu 22, Python 11.

Dear All,

Is there any way to trigger events on word or sentence boundaries during generation? Or is it possible to detect the boundaries with the result?

Thanks

Steps to Reproduce

Call the API with python to generate voice for text

✔️ Expected Behavior

It would trigger events on the boundaries to handle, or produce meta info along with the tts result.

❌ Actual Behavior

It produces bytes array without meta info or events

@securealex securealex added the help wanted Extra attention is needed label Mar 20, 2025
@SWivid
Copy link
Owner

SWivid commented Mar 20, 2025

Can you provide your screenshot of cli when you do inference, thought there is already gen_texts shown (you could just go for corresponding code in utils_infer.py)

@securealex
Copy link
Author

securealex commented Mar 20, 2025

The current inference is based on pre-segmented text. It can split input text at the sentence level for processing, but this approach may not be suitable for word-level splitting, as individual words might be too short for efficient TTS generation. Moreover, the performance of inference after segmentation suffers a loss; the inference for the same text becomes slower after segmentation. Therefore, it would be ideal if the inference process itself could align the results with words or sentences. This is similar to the functionality provided by the standard Web Speech API in browsers, which supports event callback processing at both the word and sentence levels.

@SWivid
Copy link
Owner

SWivid commented Mar 20, 2025

So you want a explictly aligned synthesis? Could refer to Voicebox and Matcha-TTS e.g. if you like that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants