Releases: mideind/Tokenizer
Releases · mideind/Tokenizer
Version 2.0.7
- Added
TOK.COMPANY
token type - Fixed a few abbreviations
- Renamed parameter
text
totext_or_gen
in functions that accept a string or a string iterator
Version 2.0.6
Fixed handling of abbreviations such as m.v. (miðað við) that should not start a new sentence even if the following word is capitalized.
Version 2.0.5
- Fixed bug where single uppercase letters were erroneously being recognized as abbreviations, causing prepositions such as 'Í' and 'Á' at the beginning of sentences to be misunderstood in ReynirPackage
- Added several abbreviations
- Silently correct connectors -og and -eða that are affixed to words, splitting them up in the tokenization process; Tösku-og hanskabúðin thus becomes a single token with correct spacing: Tösku- og hanskabúðin
Version 2.0.4
- Added imperfect abbreviations (amk., osfrv.)
- Recognized klukkan hálf tvö as a
TOK.TIME
Version 2.0.3
- Fixed bug in
detokenize()
where abbreviations, domains and e-mails containing periods were wrongly split
Version 2.0.2
- Spelled-out day ordinals are no longer included as a part of
TOK.DATEREL
tokens. Thus, þriðji júní is now aTOK.WORD
followed by aTOK.DATEREL
. 3. júní continues to be parsed as a singleTOK.DATEREL
.
Version 2.0.1
- Order of abbreviation meanings within the
token.val
field made deterministic. Abbreviations are listed in the same order in token.val as they appear in theAbbrev.conf
file. - Fixed bug in measurement unit handling
Version 2.0.0
- Added command line tool
- Added
split_into_sentences()
anddetokenize()
functions - Removed
convert_telno
option - Splitting of coalesced tokens made more robust
- Added
TOK.SSN
,TOK.MOLECULE
,TOK.USERNAME
andTOK.SERIALNUMBER
token kinds - Abbreviations can now have multiple meanings
Version 1.4.1
- Abbreviations of verbs (dags., f., d.) now return the verb stem as the associated word.
- Source code formatting improved.
- Preparations for more fine-grained control of tokenizer behavior via configuration flags.
Version 1.4.0
- Added configuration option parameters to the
tokenizer.tokenize()
function, controlling the conversion of numbers and telephone numbers to canonical/Icelandic format, and the handling of 'kludgy' ordinals (3ji, 2ja). - Added several abbreviations.
- Minor performance enhancements.
- Added a number of test cases.