Releases: mideind/Tokenizer
Releases · mideind/Tokenizer
Version 3.1.2
- Changed paragraph markers to be
[[
and]]
, i.e. without spaces, for better accuracy in character offset calculations.
Version 3.1.1
- Minor fix;
Tok.from_token()
added
Version 3.1.0
- Added
-o
switch totokenize
command to return original token text, enabling the tokenizer to run as a sentence splitter only.
Version 3.0.0
- Added tracking of character offsets for tokens within the original source text.
- Added full type annotations.
- Dropped Python 2.7 support. Tokenizer now supports Python >= 3.6.
Version 2.5.0
- Added command-line arguments to the tokenizer executable, corresponding to available tokenization options
- Updated and enhanced type annotations
- Minor documentation edits
Version 2.4.0
- Fixed bug where certain well-known word forms (fá, fær, mín, sá...) were being interpreted as (wrong) abbreviations.
- Also fixed bug where certain abbreviations were being recognized even in uppercase and at the end of a sentence, for instance Örn.
Version 2.3.1
Various bug fixes; fixed type annotations for Python 2.7; the token kind NUMBER WITH LETTER
is now NUMWLETTER
.
Version 2.3.0
Added the replace_html_escapes
option to the tokenize()
function.
Version 2.2.0
Fixed correct_spaces()
to handle compounds such as Atvinnu-, nýsköpunar- og ferðamálaráðuneytið and bensínstöðvar, -dælur og -tankar.
Version 2.1.0
- Changed handling of periods at end of sentences if they are a part of an abbreviation. Now, the period is kept attached to the abbreviation, not split off into a separate period token, as before.