create 33k Chinese samples #9

Gpwner · 2025-03-13T03:48:50Z

add chinese samples

Gpwner · 2025-03-13T03:50:13Z

hi,I have add 33k good quality Chinese data generate script.

marcus-daily · 2025-04-02T08:59:50Z

Thank you for the PR and sorry for the slow response. If I understand correctly this truncates sentences at a random position, and uses the truncated sentence as the "incomplete" data, and the original sentence as the "complete" data.

In the current phase of the project we've trained the model to work in conjunction with VAD (voice activity detection), so the model will only run when there is a pause in the speech. In particular the model detects filler words like "um" and "er", and intonation common when someone is thinking of what to say.

Truncating mid-sentence is a different kind of data, that will probably be useful in a later phase of the project (particularly if we want to remove the need for the separate VAD step), but since it doesn't use filler words or pauses, it's not suitable for the current stage. I'll leave the PR open for now, as that might change.

The covost2 dataset might still be useful if we can run VAD on it to find samples containing a pause. If there are lots of samples with a pause, we can split the samples at the pause to obtain "incomplete" sentences in the desired format.

Gpwner · 2025-04-02T09:41:12Z

@marcus-daily I think the project should support semantic vad like openai

marcus-daily · 2025-04-02T10:14:23Z

@Gpwner Agreed, this is a long-term goal of the project. Currently smart-turn is a lot more resource intensive than a VAD model like Silero, so if we want to combine the two, some optimisation work will be needed.

To clarify: smart-turn does already support semantic VAD, but this needs to be done in combination with a non-semantic VAD model like Silero (so that smart-turn knows when to run).

Create generate_zh_CN_data.py

4cbb7c5

add chinese samples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create 33k Chinese samples #9

create 33k Chinese samples #9

Gpwner commented Mar 13, 2025

Gpwner commented Mar 13, 2025

marcus-daily commented Apr 2, 2025

Gpwner commented Apr 2, 2025

marcus-daily commented Apr 2, 2025 •

edited

Loading

create 33k Chinese samples #9

Are you sure you want to change the base?

create 33k Chinese samples #9

Conversation

Gpwner commented Mar 13, 2025

Gpwner commented Mar 13, 2025

marcus-daily commented Apr 2, 2025

Gpwner commented Apr 2, 2025

marcus-daily commented Apr 2, 2025 • edited Loading

marcus-daily commented Apr 2, 2025 •

edited

Loading