Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a check_structure function to sinopy #5

Open
Wu-Urbanek opened this issue Apr 20, 2020 · 0 comments
Open

add a check_structure function to sinopy #5

Wu-Urbanek opened this issue Apr 20, 2020 · 0 comments
Assignees

Comments

@Wu-Urbanek
Copy link
Collaborator

A new function "check_structure" can be implemented to check the tokenization of words of East Asian or Southeast Asian languages.

For example, the following issues can be detected:

  1. If each syllable has at least a initial consonant, a vowel (and/or a tonal marker). For example, cases like ʔ/s-n-yum1 might be trimmed as ʔ (if slash is used as a separator and the script choose the first form only). This error will not be found by regular tokenization check. But the check_structure can complain vowel missing, and people can further spot the error.

  2. If an empty syllable exists in the dataset. Mostly are caused by two morpheme boundaries e.g. z a ² + + m i ² (should be z a ² + m i ²)

And related issue is issue number 3 in lexibank/lamayi repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants