Skip to content

Kokoro with all supported languages and voices + Orpheus added to API and UI #58

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ivanfioravanti
Copy link
Collaborator

/voices API added to get list of Kokoro voices and filter them by language for the frontend.

Closes #29 and #30

@ivanfioravanti ivanfioravanti requested a review from Blaizzy March 23, 2025 21:04
@Blaizzy
Copy link
Owner

Blaizzy commented Mar 23, 2025

This is great!

I was thinking about the same but for all models.

Because Orpheus has serval voices as well.

@ivanfioravanti
Copy link
Collaborator Author

It's a great idea! Adding Orpheus model and voices right now 🚀

@ivanfioravanti
Copy link
Collaborator Author

Done and ready for review @Blaizzy 🚀

@ivanfioravanti ivanfioravanti changed the title chore: update dependencies and enhance TTS Kokoro with all supported languages Kokoro with all supported languages and voices + Orpheus added to API and UI Mar 23, 2025
@ivanfioravanti
Copy link
Collaborator Author

@Blaizzy I tested all Orpheus voices 1 by 1, some of them are not working. Tara, Zac e Zoe create long audio with empty parts or prolonged audio. Even with generate from command line. Give them a try.

@Blaizzy
Copy link
Owner

Blaizzy commented Mar 26, 2025

Hey Ivan

Yes, you are right! I noticed the same.

I would remove those voices for now. Add some comments and we can revisit them later.

@ivanfioravanti
Copy link
Collaborator Author

We can try to add back all voices after #68

@ivanfioravanti
Copy link
Collaborator Author

Closed by mistake, working on it.

@ivanfioravanti ivanfioravanti force-pushed the main branch 2 times, most recently from 2b45fde to ece3ca6 Compare March 29, 2025 16:05
@Blaizzy
Copy link
Owner

Blaizzy commented Mar 29, 2025

No worries, let me know when you ready :)

@ivanfioravanti
Copy link
Collaborator Author

Ok @Blaizzy ready to go. Orpheus was fixed at 15 seconds of audio. I changed logic to be able to split text in multiple ways.
Everything seems good to me:

  • All voices and languages added for Orpheus
  • Longer audio generation in Orpheus

@ivanfioravanti
Copy link
Collaborator Author

@Blaizzy ready!

Comment on lines 93 to +99
"mlx-community/Kokoro-82M-6bit",
"mlx-community/Kokoro-82M-8bit",
"mlx-community/Kokoro-82M-bf16",
"mlx-community/orpheus-3b-0.1-ft-bf16",
"mlx-community/orpheus-3b-0.1-ft-8bit",
"mlx-community/orpheus-3b-0.1-ft-6bit",
"mlx-community/orpheus-3b-0.1-ft-4bit",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we use something like this:

from huggingface_hub import HfApi

# Initialize the API
hf_api = HfApi()

# Search for models from a specific organization
models = hf_api.list_models(
    author="mlx-community"  # Replace with the actual organization name
)

# Print the results
for model in models:
    print(model.id, model.downloads)

Output:

mlx-community/csm-1b
mlx-community/Qwen2.5-VL-32B-Instruct-bf16
...

What do you think?

Note: The user might need to export the HF Access token for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we'll bring in many models, including non audio ones. Maybe we can think something around this later, when we'll support more models. No?

Comment on lines +427 to +437
available_models = [
{"id": "mlx-community/Kokoro-82M-4bit", "name": "Kokoro 82M 4bit"},
{"id": "mlx-community/Kokoro-82M-6bit", "name": "Kokoro 82M 6bit"},
{"id": "mlx-community/Kokoro-82M-8bit", "name": "Kokoro 82M 8bit"},
{"id": "mlx-community/Kokoro-82M-bf16", "name": "Kokoro 82M bf16"},
{"id": "mlx-community/orpheus-3b-0.1-ft-bf16", "name": "Orpheus 3B bf16"},
{"id": "mlx-community/orpheus-3b-0.1-ft-8bit", "name": "Orpheus 3B 8bit"},
{"id": "mlx-community/orpheus-3b-0.1-ft-6bit", "name": "Orpheus 3B 6bit"},
{"id": "mlx-community/orpheus-3b-0.1-ft-4bit", "name": "Orpheus 3B 4bit"},
]

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also support sesame now :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True! Let's start closing this to support all languages for Kokoro and Orpheus (that is amazing).
We'll create a separate PR for sesame.

Comment on lines +220 to +225
parser.add_argument(
"--max_audio_length",
type=float,
default=90.0,
help="Maximum audio length per segment in seconds",
)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the max length of a single chuck, we can hardcode 90 seconds, this was the default in the codebase. If we leave this, it can be used to create smaller chunk in systems with low memory.

# Process each chunk separately
for chunk_idx, input_ids in enumerate(all_modified_input_ids):
sampler = make_sampler(temperature, top_p, top_k=kwargs.get("top_k", -1))
logits_processors = make_logits_processors(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

peak_memory_usage=mx.metal.get_peak_memory() / 1e9,
)
if len(all_prompts) != len(my_samples):
# If there's a mismatch, just provide what we have
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should throw an error.

Comment on lines 344 to 353
# Further split long prompts into smaller chunks
all_prompts = []
for p in prompts:
if len(p) > 300: # Only split if text is longer than 300 chars
chunks = self._split_text_into_chunks(p)
all_prompts.extend(chunks)
else:
all_prompts.append(p)

prompts = [f"{voice}: " + p for p in all_prompts if p.strip()]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I think we should revert this most changes except this part.

Because the downstream code handled prompt list.

@Blaizzy
Copy link
Owner

Blaizzy commented Mar 29, 2025

@lucasnewman could you please check the sesame changes and see if anything stands out?

I noticed that the generate doesn't process list of prompts like Kokoro (pipeline) and Orpheus.

Initially I thought of enforcing all models to use a pipeline that would serve to handle list of inputs, but for Orpheus I just keep the idea inside generate because since it's an LLM, the pipeline code was just gonna be a few of code .

@lucasnewman
Copy link
Collaborator

@lucasnewman could you please check the sesame changes and see if anything stands out?

Looks fine to me apart from your comments.

I noticed that the generate doesn't process list of prompts like Kokoro (pipeline) and Orpheus.

Initially I thought of enforcing all models to use a pipeline that would serve to handle list of inputs, but for Orpheus I just keep the idea inside generate because since it's an LLM, the pipeline code was just gonna be a few of code .

Yeah, I personally prefer the simplest approach and lighter abstraction. I think it's reasonable to have every generate() implementation take either a string or list of strings though, since sentence splitting is so common / useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[help]how to change the default language in webui
3 participants