Skip to content

create 33k Chinese samples #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

create 33k Chinese samples #9

wants to merge 1 commit into from

Conversation

Gpwner
Copy link

@Gpwner Gpwner commented Mar 13, 2025

add chinese samples

add chinese samples
@Gpwner
Copy link
Author

Gpwner commented Mar 13, 2025

hi,I have add 33k good quality Chinese data generate script.

@marcus-daily
Copy link
Contributor

Thank you for the PR and sorry for the slow response. If I understand correctly this truncates sentences at a random position, and uses the truncated sentence as the "incomplete" data, and the original sentence as the "complete" data.

In the current phase of the project we've trained the model to work in conjunction with VAD (voice activity detection), so the model will only run when there is a pause in the speech. In particular the model detects filler words like "um" and "er", and intonation common when someone is thinking of what to say.

Truncating mid-sentence is a different kind of data, that will probably be useful in a later phase of the project (particularly if we want to remove the need for the separate VAD step), but since it doesn't use filler words or pauses, it's not suitable for the current stage. I'll leave the PR open for now, as that might change.

The covost2 dataset might still be useful if we can run VAD on it to find samples containing a pause. If there are lots of samples with a pause, we can split the samples at the pause to obtain "incomplete" sentences in the desired format.

@Gpwner
Copy link
Author

Gpwner commented Apr 2, 2025

@marcus-daily I think the project should support semantic vad like openai

@marcus-daily
Copy link
Contributor

marcus-daily commented Apr 2, 2025

@Gpwner Agreed, this is a long-term goal of the project. Currently smart-turn is a lot more resource intensive than a VAD model like Silero, so if we want to combine the two, some optimisation work will be needed.

To clarify: smart-turn does already support semantic VAD, but this needs to be done in combination with a non-semantic VAD model like Silero (so that smart-turn knows when to run).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants