-
-
Notifications
You must be signed in to change notification settings - Fork 56
Kokoro with all supported languages and voices + Orpheus added to API and UI #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This is great! I was thinking about the same but for all models. Because Orpheus has serval voices as well. |
It's a great idea! Adding Orpheus model and voices right now 🚀 |
Done and ready for review @Blaizzy 🚀 |
@Blaizzy I tested all Orpheus voices 1 by 1, some of them are not working. Tara, Zac e Zoe create long audio with empty parts or prolonged audio. Even with generate from command line. Give them a try. |
Hey Ivan Yes, you are right! I noticed the same. I would remove those voices for now. Add some comments and we can revisit them later. |
We can try to add back all voices after #68 |
Closed by mistake, working on it. |
2b45fde
to
ece3ca6
Compare
No worries, let me know when you ready :) |
Ok @Blaizzy ready to go. Orpheus was fixed at 15 seconds of audio. I changed logic to be able to split text in multiple ways.
|
@Blaizzy ready! |
"mlx-community/Kokoro-82M-6bit", | ||
"mlx-community/Kokoro-82M-8bit", | ||
"mlx-community/Kokoro-82M-bf16", | ||
"mlx-community/orpheus-3b-0.1-ft-bf16", | ||
"mlx-community/orpheus-3b-0.1-ft-8bit", | ||
"mlx-community/orpheus-3b-0.1-ft-6bit", | ||
"mlx-community/orpheus-3b-0.1-ft-4bit", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we use something like this:
from huggingface_hub import HfApi
# Initialize the API
hf_api = HfApi()
# Search for models from a specific organization
models = hf_api.list_models(
author="mlx-community" # Replace with the actual organization name
)
# Print the results
for model in models:
print(model.id, model.downloads)
Output:
mlx-community/csm-1b
mlx-community/Qwen2.5-VL-32B-Instruct-bf16
...
What do you think?
Note: The user might need to export the HF Access token for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we'll bring in many models, including non audio ones. Maybe we can think something around this later, when we'll support more models. No?
available_models = [ | ||
{"id": "mlx-community/Kokoro-82M-4bit", "name": "Kokoro 82M 4bit"}, | ||
{"id": "mlx-community/Kokoro-82M-6bit", "name": "Kokoro 82M 6bit"}, | ||
{"id": "mlx-community/Kokoro-82M-8bit", "name": "Kokoro 82M 8bit"}, | ||
{"id": "mlx-community/Kokoro-82M-bf16", "name": "Kokoro 82M bf16"}, | ||
{"id": "mlx-community/orpheus-3b-0.1-ft-bf16", "name": "Orpheus 3B bf16"}, | ||
{"id": "mlx-community/orpheus-3b-0.1-ft-8bit", "name": "Orpheus 3B 8bit"}, | ||
{"id": "mlx-community/orpheus-3b-0.1-ft-6bit", "name": "Orpheus 3B 6bit"}, | ||
{"id": "mlx-community/orpheus-3b-0.1-ft-4bit", "name": "Orpheus 3B 4bit"}, | ||
] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also support sesame now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True! Let's start closing this to support all languages for Kokoro and Orpheus (that is amazing).
We'll create a separate PR for sesame.
parser.add_argument( | ||
"--max_audio_length", | ||
type=float, | ||
default=90.0, | ||
help="Maximum audio length per segment in seconds", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the max length of a single chuck, we can hardcode 90 seconds, this was the default in the codebase. If we leave this, it can be used to create smaller chunk in systems with low memory.
mlx_audio/tts/models/llama/llama.py
Outdated
# Process each chunk separately | ||
for chunk_idx, input_ids in enumerate(all_modified_input_ids): | ||
sampler = make_sampler(temperature, top_p, top_k=kwargs.get("top_k", -1)) | ||
logits_processors = make_logits_processors( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
mlx_audio/tts/models/llama/llama.py
Outdated
peak_memory_usage=mx.metal.get_peak_memory() / 1e9, | ||
) | ||
if len(all_prompts) != len(my_samples): | ||
# If there's a mismatch, just provide what we have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should throw an error.
mlx_audio/tts/models/llama/llama.py
Outdated
# Further split long prompts into smaller chunks | ||
all_prompts = [] | ||
for p in prompts: | ||
if len(p) > 300: # Only split if text is longer than 300 chars | ||
chunks = self._split_text_into_chunks(p) | ||
all_prompts.extend(chunks) | ||
else: | ||
all_prompts.append(p) | ||
|
||
prompts = [f"{voice}: " + p for p in all_prompts if p.strip()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I think we should revert this most changes except this part.
Because the downstream code handled prompt list.
@lucasnewman could you please check the sesame changes and see if anything stands out? I noticed that the generate doesn't process list of prompts like Kokoro (pipeline) and Orpheus. Initially I thought of enforcing all models to use a |
Looks fine to me apart from your comments.
Yeah, I personally prefer the simplest approach and lighter abstraction. I think it's reasonable to have every generate() implementation take either a string or list of strings though, since sentence splitting is so common / useful. |
/voices API added to get list of Kokoro voices and filter them by language for the frontend.
Closes #29 and #30