Skip to content

Releases: mideind/Tokenizer

Version 2.0.7

24 Jun 15:49
Compare
Choose a tag to compare
  • Added TOK.COMPANY token type
  • Fixed a few abbreviations
  • Renamed parameter text to text_or_gen in functions that accept a string or a string iterator

Version 2.0.6

29 May 12:59
Compare
Choose a tag to compare

Fixed handling of abbreviations such as m.v. (miðað við) that should not start a new sentence even if the following word is capitalized.

Version 2.0.5

26 Mar 23:26
Compare
Choose a tag to compare
  • Fixed bug where single uppercase letters were erroneously being recognized as abbreviations, causing prepositions such as 'Í' and 'Á' at the beginning of sentences to be misunderstood in ReynirPackage
  • Added several abbreviations
  • Silently correct connectors -og and -eða that are affixed to words, splitting them up in the tokenization process; Tösku-og hanskabúðin thus becomes a single token with correct spacing: Tösku- og hanskabúðin

Version 2.0.4

17 Feb 16:15
Compare
Choose a tag to compare
  • Added imperfect abbreviations (amk., osfrv.)
  • Recognized klukkan hálf tvö as a TOK.TIME

Version 2.0.3

17 Dec 15:55
Compare
Choose a tag to compare
  • Fixed bug in detokenize() where abbreviations, domains and e-mails containing periods were wrongly split

Version 2.0.2

11 Dec 14:30
Compare
Choose a tag to compare
  • Spelled-out day ordinals are no longer included as a part of TOK.DATEREL tokens. Thus, þriðji júní is now a TOK.WORD followed by a TOK.DATEREL. 3. júní continues to be parsed as a single TOK.DATEREL.

Version 2.0.1

09 Dec 15:35
Compare
Choose a tag to compare
  • Order of abbreviation meanings within the token.val field made deterministic. Abbreviations are listed in the same order in token.val as they appear in the Abbrev.conf file.
  • Fixed bug in measurement unit handling

Version 2.0.0

04 Dec 16:18
Compare
Choose a tag to compare
  • Added command line tool
  • Added split_into_sentences() and detokenize() functions
  • Removed convert_telno option
  • Splitting of coalesced tokens made more robust
  • Added TOK.SSN, TOK.MOLECULE, TOK.USERNAME and TOK.SERIALNUMBER token kinds
  • Abbreviations can now have multiple meanings

Version 1.4.1

22 Oct 17:24
Compare
Choose a tag to compare
  • Abbreviations of verbs (dags., f., d.) now return the verb stem as the associated word.
  • Source code formatting improved.
  • Preparations for more fine-grained control of tokenizer behavior via configuration flags.

Version 1.4.0

16 Jul 17:17
Compare
Choose a tag to compare
  • Added configuration option parameters to the tokenizer.tokenize() function, controlling the conversion of numbers and telephone numbers to canonical/Icelandic format, and the handling of 'kludgy' ordinals (3ji, 2ja).
  • Added several abbreviations.
  • Minor performance enhancements.
  • Added a number of test cases.