Releases · mideind/Tokenizer

24 Jun 15:49

vthorsteinsson

2.0.7

18dc777

Version 2.0.7

Added TOK.COMPANY token type
Fixed a few abbreviations
Renamed parameter text to text_or_gen in functions that accept a string or a string iterator

Assets 2

29 May 12:59

vthorsteinsson

2.0.6

a6c1ce4

Version 2.0.6

Fixed handling of abbreviations such as m.v. (miðað við) that should not start a new sentence even if the following word is capitalized.

Assets 2

26 Mar 23:26

vthorsteinsson

2.0.5

4711e73

Version 2.0.5

Fixed bug where single uppercase letters were erroneously being recognized as abbreviations, causing prepositions such as 'Í' and 'Á' at the beginning of sentences to be misunderstood in ReynirPackage
Added several abbreviations
Silently correct connectors -og and -eða that are affixed to words, splitting them up in the tokenization process; Tösku-og hanskabúðin thus becomes a single token with correct spacing: Tösku- og hanskabúðin

Assets 2

17 Feb 16:15

vthorsteinsson

2.0.4

1b6d439

Version 2.0.4

Added imperfect abbreviations (amk., osfrv.)
Recognized klukkan hálf tvö as a TOK.TIME

Assets 2

17 Dec 15:55

vthorsteinsson

2.0.3

0f1d0ab

Version 2.0.3

Fixed bug in detokenize() where abbreviations, domains and e-mails containing periods were wrongly split

Assets 2

11 Dec 14:30

vthorsteinsson

2.0.2

9c85e28

Version 2.0.2

Spelled-out day ordinals are no longer included as a part of TOK.DATEREL tokens. Thus, þriðji júní is now a TOK.WORD followed by a TOK.DATEREL. 3. júní continues to be parsed as a single TOK.DATEREL.

Assets 2

09 Dec 15:35

vthorsteinsson

2.0.1

191fa1a

Version 2.0.1

Order of abbreviation meanings within the token.val field made deterministic. Abbreviations are listed in the same order in token.val as they appear in the Abbrev.conf file.
Fixed bug in measurement unit handling

Assets 2

04 Dec 16:18

vthorsteinsson

2.0.0

9700f5b

Version 2.0.0

Added command line tool
Added split_into_sentences() and detokenize() functions
Removed convert_telno option
Splitting of coalesced tokens made more robust
Added TOK.SSN, TOK.MOLECULE, TOK.USERNAME and TOK.SERIALNUMBER token kinds
Abbreviations can now have multiple meanings

Assets 2

22 Oct 17:24

vthorsteinsson

1.4.1

8dda7be

Version 1.4.1

Abbreviations of verbs (dags., f., d.) now return the verb stem as the associated word.
Source code formatting improved.
Preparations for more fine-grained control of tokenizer behavior via configuration flags.

Assets 2

16 Jul 17:17

vthorsteinsson

1.4.0

e7dc127

Version 1.4.0

Added configuration option parameters to the tokenizer.tokenize() function, controlling the conversion of numbers and telephone numbers to canonical/Icelandic format, and the handling of 'kludgy' ordinals (3ji, 2ja).
Added several abbreviations.
Minor performance enhancements.
Added a number of test cases.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: mideind/Tokenizer

Version 2.0.7

Uh oh!

Version 2.0.6

Uh oh!

Version 2.0.5

Uh oh!

Version 2.0.4

Uh oh!

Version 2.0.3

Uh oh!

Version 2.0.2

Uh oh!

Version 2.0.1

Uh oh!

Version 2.0.0

Uh oh!

Version 1.4.1

Uh oh!

Version 1.4.0

Uh oh!