136+ languages
(thus their dataset must contain sentences too):
- Chinese
- Japanese
- Tibetan
- MIT Original data for the 15 languages taken from gruut databases
- MIT To this the data for the 31 languages were added ipa dict files
- CC0: Public Domain Chinese/Mandarin-IPA language sentence pairs were generated:
- from the chinese sentences taken from dataset from kaggle
- based on the above dictionary and MISTRAL-nemo made IPA dictionary which was paired with chinese sentences to ipa sentences using string substitution.
- Mozilla Public License 2.0 Chinese/Mandarin Extra missing phrases added-on from Mozilla Common Voice 19.
- Apache-2.0 Wikipron data were added for selected large languages from wikipron data
- cc-by-nc-4.0 Tibetan taken from billingsmoore
- cc-by-nc-4.0 Tibetan added more data at billingsmoore
- MIT/Apache2 Slovak language is mostly self made, I hereby dedicate it under MIT/Apache2
- CC-BY-SA 3.0: Text in Japanese corpus is licensed as follows. The text data were modified and pronunciation information is added. basic5000 is as follows:
- wikipedia wikipedia CC-BY-SA 3.0
- TANAKA corpus Tanaka_Corpus CC-BY 2.0
- JSUT (Japanese speech corpus of Saruwatari-lab., University of Tokyo) JSUT. CC-BY-SA 4.0
- Mozilla Public License 2.0 Japanese City Names added-on from Mozilla Common Voice 19.
- Apache-2.0 US/UK English data sourced from Kokoro Misaki
- Unknown license Cantonese words and sentences soruced from the github
- Apache-2.0 license English - Homographs data (multi.tsv) sourced mainly from: Homograph disambiguation data
- cc-by-nc-sa-4.0 Hokkien Taiwanese Minnan - Data from sarahwei