Kokoro with all supported languages and voices + Orpheus added to API and UI #58

ivanfioravanti · 2025-03-23T21:04:42Z

/voices API added to get list of Kokoro voices and filter them by language for the frontend.

Closes #29 and #30

Blaizzy · 2025-03-23T21:11:53Z

This is great!

I was thinking about the same but for all models.

Because Orpheus has serval voices as well.

ivanfioravanti · 2025-03-23T21:55:10Z

It's a great idea! Adding Orpheus model and voices right now 🚀

ivanfioravanti · 2025-03-23T22:18:23Z

Done and ready for review @Blaizzy 🚀

ivanfioravanti · 2025-03-23T22:27:51Z

@Blaizzy I tested all Orpheus voices 1 by 1, some of them are not working. Tara, Zac e Zoe create long audio with empty parts or prolonged audio. Even with generate from command line. Give them a try.

Blaizzy · 2025-03-26T23:49:12Z

Hey Ivan

Yes, you are right! I noticed the same.

I would remove those voices for now. Add some comments and we can revisit them later.

ivanfioravanti · 2025-03-29T14:56:37Z

We can try to add back all voices after #68

ivanfioravanti · 2025-03-29T15:09:13Z

Closed by mistake, working on it.

Blaizzy · 2025-03-29T16:12:30Z

No worries, let me know when you ready :)

ivanfioravanti · 2025-03-29T16:26:19Z

Ok @Blaizzy ready to go. Orpheus was fixed at 15 seconds of audio. I changed logic to be able to split text in multiple ways.
Everything seems good to me:

All voices and languages added for Orpheus
Longer audio generation in Orpheus

requirements.txt

ivanfioravanti · 2025-03-29T17:01:54Z

@Blaizzy ready!

Blaizzy · 2025-03-29T17:23:27Z

mlx_audio/server.py

        "mlx-community/Kokoro-82M-6bit",
        "mlx-community/Kokoro-82M-8bit",
        "mlx-community/Kokoro-82M-bf16",
+        "mlx-community/orpheus-3b-0.1-ft-bf16",
+        "mlx-community/orpheus-3b-0.1-ft-8bit",
+        "mlx-community/orpheus-3b-0.1-ft-6bit",
+        "mlx-community/orpheus-3b-0.1-ft-4bit",


How about we use something like this:

from huggingface_hub import HfApi # Initialize the API hf_api = HfApi() # Search for models from a specific organization models = hf_api.list_models( author="mlx-community" # Replace with the actual organization name ) # Print the results for model in models: print(model.id, model.downloads)

Output:

mlx-community/csm-1b mlx-community/Qwen2.5-VL-32B-Instruct-bf16 ...

What do you think?

Note: The user might need to export the HF Access token for this.

But we'll bring in many models, including non audio ones. Maybe we can think something around this later, when we'll support more models. No?

Blaizzy · 2025-03-29T17:24:14Z

mlx_audio/server.py

+available_models = [
+    {"id": "mlx-community/Kokoro-82M-4bit", "name": "Kokoro 82M 4bit"},
+    {"id": "mlx-community/Kokoro-82M-6bit", "name": "Kokoro 82M 6bit"},
+    {"id": "mlx-community/Kokoro-82M-8bit", "name": "Kokoro 82M 8bit"},
+    {"id": "mlx-community/Kokoro-82M-bf16", "name": "Kokoro 82M bf16"},
+    {"id": "mlx-community/orpheus-3b-0.1-ft-bf16", "name": "Orpheus 3B bf16"},
+    {"id": "mlx-community/orpheus-3b-0.1-ft-8bit", "name": "Orpheus 3B 8bit"},
+    {"id": "mlx-community/orpheus-3b-0.1-ft-6bit", "name": "Orpheus 3B 6bit"},
+    {"id": "mlx-community/orpheus-3b-0.1-ft-4bit", "name": "Orpheus 3B 4bit"},
+]
+


We also support sesame now :)

True! Let's start closing this to support all languages for Kokoro and Orpheus (that is amazing).
We'll create a separate PR for sesame.

Blaizzy · 2025-03-29T17:25:44Z

mlx_audio/tts/generate.py

+    parser.add_argument(
+        "--max_audio_length",
+        type=float,
+        default=90.0,
+        help="Maximum audio length per segment in seconds",
+    )


This is the max length of a single chuck, we can hardcode 90 seconds, this was the default in the codebase. If we leave this, it can be used to create smaller chunk in systems with low memory.

mlx_audio/tts/models/llama/llama.py

Blaizzy · 2025-03-29T17:28:57Z

mlx_audio/tts/models/llama/llama.py

+        # Process each chunk separately
+        for chunk_idx, input_ids in enumerate(all_modified_input_ids):
+            sampler = make_sampler(temperature, top_p, top_k=kwargs.get("top_k", -1))
+            logits_processors = make_logits_processors(


mlx_audio/tts/models/sesame/model.py

Blaizzy · 2025-03-29T17:31:38Z

mlx_audio/tts/models/llama/llama.py

-                    peak_memory_usage=mx.metal.get_peak_memory() / 1e9,
-                )
+        if len(all_prompts) != len(my_samples):
+            # If there's a mismatch, just provide what we have


This should throw an error.

Blaizzy · 2025-03-29T18:07:30Z

mlx_audio/tts/models/llama/llama.py

+        # Further split long prompts into smaller chunks
+        all_prompts = []
+        for p in prompts:
+            if len(p) > 300:  # Only split if text is longer than 300 chars
+                chunks = self._split_text_into_chunks(p)
+                all_prompts.extend(chunks)
+            else:
+                all_prompts.append(p)
+
+        prompts = [f"{voice}: " + p for p in all_prompts if p.strip()]


On second thought, I think we should revert this most changes except this part.

Because the downstream code handled prompt list.

mlx_audio/tts/models/sesame/model.py

Blaizzy · 2025-03-29T18:13:08Z

@lucasnewman could you please check the sesame changes and see if anything stands out?

I noticed that the generate doesn't process list of prompts like Kokoro (pipeline) and Orpheus.

Initially I thought of enforcing all models to use a pipeline that would serve to handle list of inputs, but for Orpheus I just keep the idea inside generate because since it's an LLM, the pipeline code was just gonna be a few of code .

lucasnewman · 2025-03-29T19:04:27Z

@lucasnewman could you please check the sesame changes and see if anything stands out?

Looks fine to me apart from your comments.

I noticed that the generate doesn't process list of prompts like Kokoro (pipeline) and Orpheus.

Initially I thought of enforcing all models to use a pipeline that would serve to handle list of inputs, but for Orpheus I just keep the idea inside generate because since it's an LLM, the pipeline code was just gonna be a few of code .

Yeah, I personally prefer the simplest approach and lighter abstraction. I think it's reasonable to have every generate() implementation take either a string or list of strings though, since sentence splitting is so common / useful.

ivanfioravanti requested a review from Blaizzy March 23, 2025 21:04

ivanfioravanti changed the title ~~chore: update dependencies and enhance TTS Kokoro with all supported languages~~ Kokoro with all supported languages and voices + Orpheus added to API and UI Mar 23, 2025

This was referenced Mar 27, 2025

Orpheus sometimes generates silence. #60

Closed

Align Orpheus sampling parameters with the reference implementation #68

Merged

ivanfioravanti closed this Mar 29, 2025

ivanfioravanti force-pushed the main branch from 34ce944 to 09c2c02 Compare March 29, 2025 14:59

ivanfioravanti reopened this Mar 29, 2025

ivanfioravanti force-pushed the main branch 2 times, most recently from 2b45fde to ece3ca6 Compare March 29, 2025 16:05

Blaizzy reviewed Mar 29, 2025

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

Additional voices added to Orpheus and longer generation supported.

eb9d478

ivanfioravanti force-pushed the main branch from 68b1e13 to eb9d478 Compare March 29, 2025 16:46

ivanfioravanti self-assigned this Mar 29, 2025

Blaizzy requested changes Mar 29, 2025

View reviewed changes

Refactor as per comments.

ee3eaea

ivanfioravanti force-pushed the main branch from 9ec1b1b to ee3eaea Compare March 30, 2025 11:10

m1m1s1ku mentioned this pull request Apr 5, 2025

Any hints to add new language support ? #77

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kokoro with all supported languages and voices + Orpheus added to API and UI #58

Kokoro with all supported languages and voices + Orpheus added to API and UI #58

ivanfioravanti commented Mar 23, 2025

Blaizzy commented Mar 23, 2025

ivanfioravanti commented Mar 23, 2025

ivanfioravanti commented Mar 23, 2025

ivanfioravanti commented Mar 23, 2025

Blaizzy commented Mar 26, 2025

ivanfioravanti commented Mar 29, 2025

ivanfioravanti commented Mar 29, 2025

Blaizzy commented Mar 29, 2025

ivanfioravanti commented Mar 29, 2025

ivanfioravanti commented Mar 29, 2025

Blaizzy Mar 29, 2025

ivanfioravanti Mar 30, 2025

Blaizzy Mar 29, 2025

ivanfioravanti Mar 30, 2025

Blaizzy Mar 29, 2025

ivanfioravanti Mar 30, 2025

Blaizzy Mar 29, 2025

Blaizzy Mar 29, 2025

Blaizzy Mar 29, 2025

Blaizzy commented Mar 29, 2025 •

edited

Loading

lucasnewman commented Mar 29, 2025

Kokoro with all supported languages and voices + Orpheus added to API and UI #58

Are you sure you want to change the base?

Kokoro with all supported languages and voices + Orpheus added to API and UI #58

Conversation

ivanfioravanti commented Mar 23, 2025

Blaizzy commented Mar 23, 2025

ivanfioravanti commented Mar 23, 2025

ivanfioravanti commented Mar 23, 2025

ivanfioravanti commented Mar 23, 2025

Blaizzy commented Mar 26, 2025

ivanfioravanti commented Mar 29, 2025

ivanfioravanti commented Mar 29, 2025

Blaizzy commented Mar 29, 2025

ivanfioravanti commented Mar 29, 2025

ivanfioravanti commented Mar 29, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blaizzy commented Mar 29, 2025 • edited Loading

lucasnewman commented Mar 29, 2025

Blaizzy commented Mar 29, 2025 •

edited

Loading