add a check_structure function to sinopy #5

Wu-Urbanek · 2020-04-20T08:31:44Z

A new function "check_structure" can be implemented to check the tokenization of words of East Asian or Southeast Asian languages.

For example, the following issues can be detected:

If each syllable has at least a initial consonant, a vowel (and/or a tonal marker). For example, cases like ʔ/s-n-yum1 might be trimmed as ʔ (if slash is used as a separator and the script choose the first form only). This error will not be found by regular tokenization check. But the check_structure can complain vowel missing, and people can further spot the error.
If an empty syllable exists in the dataset. Mostly are caused by two morpheme boundaries e.g. z a ² + + m i ² (should be z a ² + m i ²)

And related issue is issue number 3 in lexibank/lamayi repository.

The text was updated successfully, but these errors were encountered:

Wu-Urbanek added the enhancement label Apr 20, 2020

Wu-Urbanek assigned LinguList Apr 20, 2020

Provide feedback