You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A new function "check_structure" can be implemented to check the tokenization of words of East Asian or Southeast Asian languages.
For example, the following issues can be detected:
If each syllable has at least a initial consonant, a vowel (and/or a tonal marker). For example, cases like ʔ/s-n-yum1 might be trimmed as ʔ (if slash is used as a separator and the script choose the first form only). This error will not be found by regular tokenization check. But the check_structure can complain vowel missing, and people can further spot the error.
If an empty syllable exists in the dataset. Mostly are caused by two morpheme boundaries e.g. z a ² + + m i ² (should be z a ² + m i ²)
And related issue is issue number 3 in lexibank/lamayi repository.
The text was updated successfully, but these errors were encountered:
A new function "check_structure" can be implemented to check the tokenization of words of East Asian or Southeast Asian languages.
For example, the following issues can be detected:
If each syllable has at least a initial consonant, a vowel (and/or a tonal marker). For example, cases like
ʔ/s-n-yum1
might be trimmed asʔ
(if slash is used as a separator and the script choose the first form only). This error will not be found by regular tokenization check. But the check_structure can complainvowel missing
, and people can further spot the error.If an empty syllable exists in the dataset. Mostly are caused by two morpheme boundaries e.g.
z a ² + + m i ²
(should be z a ² + m i ²)And related issue is issue number 3 in lexibank/lamayi repository.
The text was updated successfully, but these errors were encountered: