Skip to content

neurlang/dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IPA Phonetic dataset lexicon

136+ languages

languages not using spaces

(thus their dataset must contain sentences too):

  • Chinese
  • Japanese
  • Tibetan

Licensing information:

  • MIT Original data for the 15 languages taken from gruut databases
  • MIT To this the data for the 31 languages were added ipa dict files
  • CC0: Public Domain Chinese/Mandarin-IPA language sentence pairs were generated:
    • from the chinese sentences taken from dataset from kaggle
    • based on the above dictionary and MISTRAL-nemo made IPA dictionary which was paired with chinese sentences to ipa sentences using string substitution.
  • Mozilla Public License 2.0 Chinese/Mandarin Extra missing phrases added-on from Mozilla Common Voice 19.
  • Apache-2.0 Wikipron data were added for selected large languages from wikipron data
  • cc-by-nc-4.0 Tibetan taken from billingsmoore
  • cc-by-nc-4.0 Tibetan added more data at billingsmoore
  • MIT/Apache2 Slovak language is mostly self made, I hereby dedicate it under MIT/Apache2
  • CC-BY-SA 3.0: Text in Japanese corpus is licensed as follows. The text data were modified and pronunciation information is added. basic5000 is as follows:
    • wikipedia wikipedia CC-BY-SA 3.0
    • TANAKA corpus Tanaka_Corpus CC-BY 2.0
    • JSUT (Japanese speech corpus of Saruwatari-lab., University of Tokyo) JSUT. CC-BY-SA 4.0
  • Mozilla Public License 2.0 Japanese City Names added-on from Mozilla Common Voice 19.
  • Apache-2.0 US/UK English data sourced from Kokoro Misaki
  • Unknown license Cantonese words and sentences soruced from the github
  • Apache-2.0 license English - Homographs data (multi.tsv) sourced mainly from: Homograph disambiguation data
  • cc-by-nc-sa-4.0 Hokkien Taiwanese Minnan - Data from sarahwei