Skip to content

Add Vietnamese support using pyvi #113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

kurtisc
Copy link

@kurtisc kurtisc commented Apr 21, 2020

Hi!

Vietnamese doesn't separate words with spaces like most other languages that use the Latin alphabet[1], so the current spaces morphemizer is unsuitable.

[1] Fun read https://www.tandfonline.com/doi/pdf/10.1080/00437956.1963.11659787

I wasn't able to find a small library that would do word segmentation for Vietnamese like Jieba does for Chinese. To bundle pyvi in-code like Jieba has been bundled would require bundling many larger dependencies (e.g. Numpy).

So, if merged like this, it's unfortunately a burden on the end user to get the Vietnamese support working. On the other hand, if they don't want it, it won't appear or impact their usage.

If this gets included I'll look into packaging pyvi and it's dependencies as a separate addon like has been done for Mecab, licences permitting. That would make the installation more straight-forward and avoid forcing use of the source version of Anki.

@kurtisc
Copy link
Author

kurtisc commented Aug 15, 2020

Rebased on master and confirmed working when #125 is merged.

With regards to #145: I do have a test for this morphemizer, so hopefully that fulfils @shanrauf's comment.

@ianki
Copy link
Collaborator

ianki commented Nov 9, 2020

Would you mind rebasing again, so I can see if the tests pass? I'll submit after.

@smartlitchi
Copy link

I am really interested in this

@sedosido
Copy link

I haven’t been able to build anki from scratch to import pyvi (I think because my hardware is a little old). Is there any other way I can get vietnamese parsing to work with morphman?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants