-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
False Positives with URLS #18
Comments
The two names vb.net and asp.net are indeed working URLs (though only one is registered by Microsoft). While they are probably used much more frequently as proper names, recognizing them as URLs is technically correct. In either case, they should not be split. L/S/R and R/3 puzzled me at first. The explanation is that they are recognized as Reddit links. Reddit links take the form "/r/subreddit" or "/u/user". The leading slash is often omitted and the German Reddit community also uses "l" instead of "r". If the tokens class (URL vs. abbreviation) is important for your use case, you could either try to correct this in a postprocessing step, or, in the case of Reddit links, try to get rid of tokenizer = SoMaJo("de_CMC")
tokenizer._tokenizer.reddit_links = re.compile(r"\s{10}") When the regex for Of course, a cleaner solution would be to either have an option for enabling/disabling the recognition of Reddit links or, even better, to have an option for user specified special cases that are processed relatively early. |
You are obviously right concerning the first two. You might consider changing the regex so it no longer hits on 'r/l' or 'l/r' literally because in a technical context this often means "rechts/links" "links/rechts". ButI don't know how this would be handled in a competitive scenario. I'm already doing a lot of preprocessing, by replacing substring that I don't want to split and reintroducing them afterwards. Pretty much like you did in the pre 2.0 versions. |
I just wanted to make you aware that frameworks such as 'VB.NET' or 'ASP.Net' are considered URLs after tokenization and are thus not splitted (which is probably good). This is also the case for some abbriviations such as 'L/S/R' and SAP Versions such as R/3. Unfortunately this can't be prevented by adding them to 'single_token_abbreviations_de.txt' since they are checked after URLs. (R/3 is even included in 'single_token_abbreviations_de.txt').
The text was updated successfully, but these errors were encountered: