Skip to content

TTS and STS Models to port to MLX-Audio (Roadmap) #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 of 12 tasks
Blaizzy opened this issue Feb 28, 2025 · 6 comments
Open
3 of 12 tasks

TTS and STS Models to port to MLX-Audio (Roadmap) #1

Blaizzy opened this issue Feb 28, 2025 · 6 comments
Labels
good first issue Good for newcomers

Comments

@Blaizzy
Copy link
Owner

Blaizzy commented Feb 28, 2025

Overview

This issue outlines our roadmap for integrating additional text-to-speech (TTS) and speech-to-speech (STS) models into the MLX-Audio library to expand our offerings beyond the current Kokoro model.

Text-to-Speech (TTS) Models

Planned TTS Models

  • Zonos
  • CosyVoice2
  • StyleTTS2
  • Parler TTS
  • BARK
  • ibm-granite/granite-speech-3.2-8b
  • LLMVoX
  • MeloTTS
  • Sesame
  • CSM-1B

Speech-to-Speech (STS) Models

Planned STS Models

  • Kyutai-Labs Moshi
  • Kyutai-Labs Moshi-vis

Technical Considerations

  • All models will need MLX-specific optimizations
  • Quantization support should be implemented for each model
  • Documentation and examples will be created for each new model
  • Performance benchmarks will be established

Instructions:

  1. Select the model and comment below with your selection
  2. Create a Draft PR titled: "Add support for X"
  3. Read Contribution guide
  4. Check existing models
  5. Tag @Blaizzy for code reviews and questions.

Community Input

We welcome community feedback on prioritization and additional model suggestions. Please comment on this issue with your thoughts.

@Blaizzy Blaizzy changed the title TTS and STS Models to port to MLX-Audio TTS and STS Models to port to MLX-Audio (Roadmap) Feb 28, 2025
@Blaizzy Blaizzy added the good first issue Good for newcomers label Feb 28, 2025
@dwohlfahrt
Copy link

dwohlfahrt commented Feb 28, 2025

As always, thanks a MILLION for all the work you do @Blaizzy. You are a legend in the truest sense of the word 🙏

And now, of course, I have to chime in with my own selfish requests 😄

  1. For TTS, adding on a +1 for Zonos
  2. For STS, RVC would be huge, as I run kokoro outputs through it using RVC-generated fine tunes of custom voices to basically get the best of both worlds (aka kokoro with true voice cloning). Results are absolute 🔥 😃

@Blaizzy
Copy link
Owner Author

Blaizzy commented Feb 28, 2025

Thanks a lot, it's my pleasure!

Yes, Zonos is on the way 🚀

Could you share this RVC + Kokoro example?

@szafranek
Copy link

szafranek commented Mar 1, 2025

I found this project through your awesome demo.

Would you consider supporting StyleTTS2?

@chigkim
Copy link
Contributor

chigkim commented Mar 19, 2025

This is amazing!!!
@Blaizzy when do you sleep? VLM, now Audio?
Anyways, there's Outes. It's based on text LLMs, so even llama.cpp and ExLlamaV2 can run it.
Thanks!

@chigkim
Copy link
Contributor

chigkim commented Mar 19, 2025

Orpheus-TTS is just released, and it sounds really good!
https://github.com/canopyai/Orpheus-TTS

@lin72h
Copy link

lin72h commented Mar 19, 2025

@chigkim the KING already start working on it: #47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

5 participants