Skip to content

code-switch speech have different voice #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
c9412600 opened this issue Dec 23, 2020 · 10 comments
Open

code-switch speech have different voice #10

c9412600 opened this issue Dec 23, 2020 · 10 comments

Comments

@c9412600
Copy link

c9412600 commented Dec 23, 2020

I used your model. The experiment used the open source biaobei dataset and LJspeech dataset. It synthesized 22000 steps and successfully synthesized Chinese and English mixed speech, but the Chinese audio sound is the voice of biaobei and the English audio sound is the voice of LJspeech.
Is the number of training steps insufficient?
Thanks

@Jeevesh8
Copy link
Member

@c9412600 You mean that even after changing speaker_no that you input to the model, for the same sentence, the voice remains unchanged ? Could you attach the audio, if possible ?

@c9412600
Copy link
Author

c9412600 commented Dec 23, 2020

I set speaker_no =0 and lang =0 ,and
lj*.mel.npy/THE PRESIDENT ALMOST COMPLETELY BLOCKED OSWALD'S VIEW OF THE GOVERNOR #3 PRIOR TO THE TIME THE FIRST SHOT STRUCK THE PRESIDENT.|1|1
biaobei*.mel.npy/ta1 ti2 chu1 zhen3 duan4 dan1 biao3 shi4 zi4 ji3 de5 zuo3 xi1 you4 shou3 zhou3 dou1 cuo4 shang1.|0|0
and the wav file is as follows
test.zip

The audio speaker changed in the middle. I would like to ask if the number of training steps is not enough or the parameter setting is wrong.

@Jeevesh8
Copy link
Member

@c9412600 It is quite interesting how the voice changed in between. If you loaded pre-trained Tacotron2 weights, as in the repo; you can try training upto 40-60k steps. I don't think there will be much improvement after that.

If you haven't loaded T2 weights, then you'd require more steps.. around as many as mentioned in the paper.

@c9412600
Copy link
Author

@Jeevesh8 I don't haven't loaded pre-trained t2 weights.I will continue to train for a longer time, and I will continue to feedback the results later, thank you for your help!

@Jeevesh8
Copy link
Member

Thank you for feedback @c9412600 :)

If you want to load pre-trained weights in future, you can just provide t2 checkpoint in --checkpoint_path argument.

@c9412600
Copy link
Author

Get it!I will continue to try.

@c9412600
Copy link
Author

@Jeevesh8 There is one more thing that I forgot to consult with you. Will my phenomenon occur during your training? Different people’s voices in the same sentence, if not, how do you set up your data set?

@Jeevesh8
Copy link
Member

@c9412600 No, this phenomenon, certainly didn't occur during my training. I don't set my dataset in any special way. How frequently did you observe this phenomenon? Like in every audio you generated? Or in only very few ?

@c9412600
Copy link
Author

c9412600 commented Jan 4, 2021

@Jeevesh8 Most of them will have this phenomenon, maybe my data set has only one Chinese speaker and one English speaker. What is the composition of your data set? How many speakers? How many languages?
Thanks

@mudong0419
Copy link

@Jeevesh8 Have you solved your problem? ST-CMDS dataset has more speakers, have you tried it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants