Skip to content

Releases: mideind/Tokenizer

Version 3.1.2

02 Jun 17:04
Compare
Choose a tag to compare
  • Changed paragraph markers to be [[ and ]], i.e. without spaces, for better accuracy in character offset calculations.

Version 3.1.1

10 May 14:38
Compare
Choose a tag to compare
  • Minor fix; Tok.from_token() added

Version 3.1.0

29 Apr 10:24
Compare
Choose a tag to compare
  • Added -o switch to tokenize command to return original token text, enabling the tokenizer to run as a sentence splitter only.

Version 3.0.0

09 Apr 16:01
0e881d7
Compare
Choose a tag to compare
  • Added tracking of character offsets for tokens within the original source text.
  • Added full type annotations.
  • Dropped Python 2.7 support. Tokenizer now supports Python >= 3.6.

Version 2.5.0

08 Mar 11:45
bed46a2
Compare
Choose a tag to compare
  • Added command-line arguments to the tokenizer executable, corresponding to available tokenization options
  • Updated and enhanced type annotations
  • Minor documentation edits

Version 2.4.0

08 Oct 12:02
Compare
Choose a tag to compare
  • Fixed bug where certain well-known word forms (, fær, mín, ...) were being interpreted as (wrong) abbreviations.
  • Also fixed bug where certain abbreviations were being recognized even in uppercase and at the end of a sentence, for instance Örn.

Version 2.3.1

21 Sep 12:03
Compare
Choose a tag to compare

Various bug fixes; fixed type annotations for Python 2.7; the token kind NUMBER WITH LETTER is now NUMWLETTER.

Version 2.3.0

03 Sep 17:49
Compare
Choose a tag to compare

Added the replace_html_escapes option to the tokenize() function.

Version 2.2.0

20 Aug 22:16
Compare
Choose a tag to compare

Fixed correct_spaces() to handle compounds such as Atvinnu-, nýsköpunar- og ferðamálaráðuneytið and bensínstöðvar, -dælur og -tankar.

Version 2.1.0

02 Jul 16:05
69a4443
Compare
Choose a tag to compare
  • Changed handling of periods at end of sentences if they are a part of an abbreviation. Now, the period is kept attached to the abbreviation, not split off into a separate period token, as before.