@@ -12,7 +12,7 @@ Tokenization is a necessary first step in many natural language processing
12
12
tasks, such as word counting, parsing, spell checking, corpus generation, and
13
13
statistical analysis of text.
14
14
15
- **Tokenizer ** is a compact pure-Python (2 and 3 ) executable
15
+ **Tokenizer ** is a compact pure-Python (>= 3.6 ) executable
16
16
program and module for tokenizing Icelandic text. It converts input text to
17
17
streams of *tokens *, where each token is a separate word, punctuation sign,
18
18
number/amount, date, e-mail, URL/URI, etc. It also segments the token stream
@@ -194,10 +194,6 @@ An example of shallow tokenization from Python code goes something like this:
194
194
195
195
.. code-block :: python
196
196
197
- from __future__ import print_function
198
- # The following import is optional but convenient under Python 2.7
199
- from __future__ import unicode_literals
200
-
201
197
from tokenizer import split_into_sentences
202
198
203
199
# A string to be tokenized, containing two sentences
@@ -213,12 +209,12 @@ An example of shallow tokenization from Python code goes something like this:
213
209
tokens = sentence.split()
214
210
215
211
# Print the tokens, comma-separated
216
- print (" , " .join(tokens))
212
+ print (" | " .join(tokens))
217
213
218
214
The program outputs::
219
215
220
- 3., janúar, sl., keypti, ég, 64kWst, rafbíl, .
221
- Hann, kostaði, €30.000, .
216
+ 3.| janúar| sl.| keypti|ég| 64kWst| rafbíl| .
217
+ Hann| kostaði| €30.000| .
222
218
223
219
Deep tokenization example
224
220
=========================
@@ -227,8 +223,6 @@ To do deep tokenization from within Python code:
227
223
228
224
.. code-block :: python
229
225
230
- # The following import is optional but convenient under Python 2.7
231
- from __future__ import unicode_literals
232
226
from tokenizer import tokenize, TOK
233
227
234
228
text = (" Málinu var vísað til stjórnskipunar- og eftirlitsnefndar "
@@ -312,11 +306,6 @@ Alternatively, create a token list from the returned generator::
312
306
313
307
token_list = list(tokenizer.tokenize(mystring))
314
308
315
- In Python 2.7, you can pass either ``unicode `` strings or ``str ``
316
- byte strings to ``tokenizer.tokenize() ``. In the latter case, the
317
- byte string is assumed to be encoded in UTF-8.
318
-
319
-
320
309
The ``split_into_sentences() `` function
321
310
---------------------------------------
322
311
@@ -504,14 +493,14 @@ functions:
504
493
The token object
505
494
----------------
506
495
507
- Each token is represented by a `` namedtuple `` with three fields :
508
- ``( kind, txt, val) ``.
496
+ Each token is an instance of the class `` Tok `` that has three main properties :
497
+ ``kind ``, `` txt `` and `` val ``.
509
498
510
499
511
- The ``kind `` field
512
- ==================
500
+ The ``kind `` property
501
+ =====================
513
502
514
- The ``kind `` field contains one of the following integer constants,
503
+ The ``kind `` property contains one of the following integer constants,
515
504
defined within the ``TOK `` class:
516
505
517
506
+---------------+---------+---------------------+---------------------------+
@@ -627,14 +616,14 @@ To obtain a descriptive text for a token kind, use
627
616
``TOK.descr[token.kind] `` (see example above).
628
617
629
618
630
- The ``txt `` field
631
- ==================
619
+ The ``txt `` property
620
+ ====================
632
621
633
- The ``txt `` field contains the original source text for the token,
622
+ The ``txt `` property contains the original source text for the token,
634
623
with the following exceptions:
635
624
636
625
* All contiguous whitespace (spaces, tabs, newlines) is coalesced
637
- into single spaces (``" " ``) within the ``txt `` field . A date
626
+ into single spaces (``" " ``) within the ``txt `` string . A date
638
627
token that is parsed from a source text of ``"29. \n janúar" ``
639
628
thus has a ``txt `` of ``"29. janúar" ``.
640
629
@@ -655,10 +644,10 @@ with the following exceptions:
655
644
being escaped (``á ``).
656
645
657
646
658
- The ``val `` field
659
- ==================
647
+ The ``val `` property
648
+ ====================
660
649
661
- The ``val `` field contains auxiliary information, corresponding to
650
+ The ``val `` property contains auxiliary information, corresponding to
662
651
the token kind, as follows:
663
652
664
653
- For ``TOK.PUNCTUATION ``, the ``val `` field contains a tuple with
@@ -676,40 +665,52 @@ the token kind, as follows:
676
665
quotes are represented as Icelandic ones (i.e. „these“ or ‚these‘) in
677
666
normalized form, and ellipsis ("...") are represented as the single
678
667
character "…".
668
+
679
669
- For ``TOK.TIME ``, the ``val `` field contains an
680
670
``(hour, minute, second) `` tuple.
671
+
681
672
- For ``TOK.DATEABS ``, the ``val `` field contains a
682
673
``(year, month, day) `` tuple (all 1-based).
674
+
683
675
- For ``TOK.DATEREL ``, the ``val `` field contains a
684
676
``(year, month, day) `` tuple (all 1-based),
685
677
except that a least one of the tuple fields is missing and set to 0.
686
678
Example: *3. júní * becomes ``TOK.DATEREL `` with the fields ``(0, 6, 3) ``
687
679
as the year is missing.
680
+
688
681
- For ``TOK.YEAR ``, the ``val `` field contains the year as an integer.
689
682
A negative number indicates that the year is BCE (*fyrir Krist *),
690
683
specified with the suffix *f.Kr. * (e.g. *árið 33 f.Kr. *).
684
+
691
685
- For ``TOK.NUMBER ``, the ``val `` field contains a tuple
692
686
``(number, None, None) ``.
693
687
(The two empty fields are included for compatibility with Greynir.)
688
+
694
689
- For ``TOK.WORD ``, the ``val `` field contains the full expansion
695
690
of an abbreviation, as a list containing a single tuple, or ``None ``
696
691
if the word is not abbreviated.
692
+
697
693
- For ``TOK.PERCENT ``, the ``val `` field contains a tuple
698
694
of ``(percentage, None, None) ``.
695
+
699
696
- For ``TOK.ORDINAL ``, the ``val `` field contains the ordinal value
700
697
as an integer. The original ordinal may be a decimal number
701
698
or a Roman numeral.
699
+
702
700
- For ``TOK.TIMESTAMP ``, the ``val `` field contains
703
701
a ``(year, month, day, hour, minute, second) `` tuple.
702
+
704
703
- For ``TOK.AMOUNT ``, the ``val `` field contains
705
704
an ``(amount, currency, None, None) `` tuple. The amount is a float, and
706
705
the currency is an ISO currency code, e.g. *USD * for dollars ($ sign),
707
706
*EUR * for euros (€ sign) or *ISK * for Icelandic króna
708
707
(*kr. * abbreviation). (The two empty fields are included for
709
708
compatibility with Greynir.)
709
+
710
710
- For ``TOK.MEASUREMENT ``, the ``val `` field contains a ``(unit, value) ``
711
711
tuple, where ``unit `` is a base SI unit (such as ``g ``, ``m ``,
712
712
``m² ``, ``s ``, ``W ``, ``Hz ``, ``K `` for temperature in Kelvin).
713
+
713
714
- For ``TOK.TELNO ``, the ``val `` field contains a tuple: ``(number, cc) ``
714
715
where the first item is the phone number
715
716
in a normalized ``NNN-NNNN `` format, i.e. always including a hyphen,
@@ -733,8 +734,8 @@ An example is *o.s.frv.*, which results in a ``val`` field equal to
733
734
``[('og svo framvegis', 0, 'ao', 'frasi', 'o.s.frv.', '-')] ``.
734
735
735
736
The tuple format is designed to be compatible with the
736
- *Database of Modern Icelandic Inflection * (*DMII *),
737
- *Beygingarlýsing íslensks nútímamáls *.
737
+ *Database of Icelandic Morphology * (*DIM *),
738
+ *Beygingarlýsing íslensks nútímamáls *, i.e. the so-called * Sigrúnarsnið * .
738
739
739
740
740
741
Development installation
@@ -804,6 +805,8 @@ can be found in the file ``test/toktest_normal_gold_expected.txt``.
804
805
Changelog
805
806
---------
806
807
808
+ * Version 3.0.0: Added tracking of character offsets for tokens within the
809
+ original source text. Added full type annotations. Dropped Python 2.7 support.
807
810
* Version 2.5.0: Added arguments for all tokenizer options to the
808
811
command-line tool. Type annotations enhanced.
809
812
* Version 2.4.0: Fixed bug where certain well-known word forms (*fá *, *fær *, *mín *, *sá *...)
0 commit comments