Skip to content

Commit 0e881d7

Browse files
Merge pull request #20 from mideind/feature/nondestructive-tokenization
Feature/nondestructive tokenization
2 parents 16d5624 + 2ea28ff commit 0e881d7

16 files changed

+2749
-1336
lines changed

.github/workflows/python-package.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ jobs:
1616
strategy:
1717
matrix:
1818
os: [ubuntu-latest]
19-
python-version: [2.7, 3.6, 3.7, 3.8, 3.9, pypy-3.6]
19+
python-version: [3.6, 3.7, 3.8, 3.9, pypy-3.6]
2020

2121
steps:
2222
- uses: actions/checkout@v2

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,3 +114,6 @@ ENV/
114114
# mypy
115115
.mypy_cache/
116116
mypy.ini
117+
118+
# Vim swap files
119+
.*.swp

README.rst

Lines changed: 32 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Tokenization is a necessary first step in many natural language processing
1212
tasks, such as word counting, parsing, spell checking, corpus generation, and
1313
statistical analysis of text.
1414

15-
**Tokenizer** is a compact pure-Python (2 and 3) executable
15+
**Tokenizer** is a compact pure-Python (>= 3.6) executable
1616
program and module for tokenizing Icelandic text. It converts input text to
1717
streams of *tokens*, where each token is a separate word, punctuation sign,
1818
number/amount, date, e-mail, URL/URI, etc. It also segments the token stream
@@ -194,10 +194,6 @@ An example of shallow tokenization from Python code goes something like this:
194194

195195
.. code-block:: python
196196
197-
from __future__ import print_function
198-
# The following import is optional but convenient under Python 2.7
199-
from __future__ import unicode_literals
200-
201197
from tokenizer import split_into_sentences
202198
203199
# A string to be tokenized, containing two sentences
@@ -213,12 +209,12 @@ An example of shallow tokenization from Python code goes something like this:
213209
tokens = sentence.split()
214210
215211
# Print the tokens, comma-separated
216-
print(", ".join(tokens))
212+
print("|".join(tokens))
217213
218214
The program outputs::
219215

220-
3., janúar, sl., keypti, ég, 64kWst, rafbíl, .
221-
Hann, kostaði, €30.000, .
216+
3.|janúar|sl.|keypti|ég|64kWst|rafbíl|.
217+
Hann|kostaði|€30.000|.
222218

223219
Deep tokenization example
224220
=========================
@@ -227,8 +223,6 @@ To do deep tokenization from within Python code:
227223

228224
.. code-block:: python
229225
230-
# The following import is optional but convenient under Python 2.7
231-
from __future__ import unicode_literals
232226
from tokenizer import tokenize, TOK
233227
234228
text = ("Málinu var vísað til stjórnskipunar- og eftirlitsnefndar "
@@ -312,11 +306,6 @@ Alternatively, create a token list from the returned generator::
312306

313307
token_list = list(tokenizer.tokenize(mystring))
314308

315-
In Python 2.7, you can pass either ``unicode`` strings or ``str``
316-
byte strings to ``tokenizer.tokenize()``. In the latter case, the
317-
byte string is assumed to be encoded in UTF-8.
318-
319-
320309
The ``split_into_sentences()`` function
321310
---------------------------------------
322311

@@ -504,14 +493,14 @@ functions:
504493
The token object
505494
----------------
506495

507-
Each token is represented by a ``namedtuple`` with three fields:
508-
``(kind, txt, val)``.
496+
Each token is an instance of the class ``Tok`` that has three main properties:
497+
``kind``, ``txt`` and ``val``.
509498

510499

511-
The ``kind`` field
512-
==================
500+
The ``kind`` property
501+
=====================
513502

514-
The ``kind`` field contains one of the following integer constants,
503+
The ``kind`` property contains one of the following integer constants,
515504
defined within the ``TOK`` class:
516505

517506
+---------------+---------+---------------------+---------------------------+
@@ -627,14 +616,14 @@ To obtain a descriptive text for a token kind, use
627616
``TOK.descr[token.kind]`` (see example above).
628617

629618

630-
The ``txt`` field
631-
==================
619+
The ``txt`` property
620+
====================
632621

633-
The ``txt`` field contains the original source text for the token,
622+
The ``txt`` property contains the original source text for the token,
634623
with the following exceptions:
635624

636625
* All contiguous whitespace (spaces, tabs, newlines) is coalesced
637-
into single spaces (``" "``) within the ``txt`` field. A date
626+
into single spaces (``" "``) within the ``txt`` string. A date
638627
token that is parsed from a source text of ``"29. \n janúar"``
639628
thus has a ``txt`` of ``"29. janúar"``.
640629

@@ -655,10 +644,10 @@ with the following exceptions:
655644
being escaped (``á``).
656645

657646

658-
The ``val`` field
659-
==================
647+
The ``val`` property
648+
====================
660649

661-
The ``val`` field contains auxiliary information, corresponding to
650+
The ``val`` property contains auxiliary information, corresponding to
662651
the token kind, as follows:
663652

664653
- For ``TOK.PUNCTUATION``, the ``val`` field contains a tuple with
@@ -676,40 +665,52 @@ the token kind, as follows:
676665
quotes are represented as Icelandic ones (i.e. „these“ or ‚these‘) in
677666
normalized form, and ellipsis ("...") are represented as the single
678667
character "…".
668+
679669
- For ``TOK.TIME``, the ``val`` field contains an
680670
``(hour, minute, second)`` tuple.
671+
681672
- For ``TOK.DATEABS``, the ``val`` field contains a
682673
``(year, month, day)`` tuple (all 1-based).
674+
683675
- For ``TOK.DATEREL``, the ``val`` field contains a
684676
``(year, month, day)`` tuple (all 1-based),
685677
except that a least one of the tuple fields is missing and set to 0.
686678
Example: *3. júní* becomes ``TOK.DATEREL`` with the fields ``(0, 6, 3)``
687679
as the year is missing.
680+
688681
- For ``TOK.YEAR``, the ``val`` field contains the year as an integer.
689682
A negative number indicates that the year is BCE (*fyrir Krist*),
690683
specified with the suffix *f.Kr.* (e.g. *árið 33 f.Kr.*).
684+
691685
- For ``TOK.NUMBER``, the ``val`` field contains a tuple
692686
``(number, None, None)``.
693687
(The two empty fields are included for compatibility with Greynir.)
688+
694689
- For ``TOK.WORD``, the ``val`` field contains the full expansion
695690
of an abbreviation, as a list containing a single tuple, or ``None``
696691
if the word is not abbreviated.
692+
697693
- For ``TOK.PERCENT``, the ``val`` field contains a tuple
698694
of ``(percentage, None, None)``.
695+
699696
- For ``TOK.ORDINAL``, the ``val`` field contains the ordinal value
700697
as an integer. The original ordinal may be a decimal number
701698
or a Roman numeral.
699+
702700
- For ``TOK.TIMESTAMP``, the ``val`` field contains
703701
a ``(year, month, day, hour, minute, second)`` tuple.
702+
704703
- For ``TOK.AMOUNT``, the ``val`` field contains
705704
an ``(amount, currency, None, None)`` tuple. The amount is a float, and
706705
the currency is an ISO currency code, e.g. *USD* for dollars ($ sign),
707706
*EUR* for euros (€ sign) or *ISK* for Icelandic króna
708707
(*kr.* abbreviation). (The two empty fields are included for
709708
compatibility with Greynir.)
709+
710710
- For ``TOK.MEASUREMENT``, the ``val`` field contains a ``(unit, value)``
711711
tuple, where ``unit`` is a base SI unit (such as ``g``, ``m``,
712712
````, ``s``, ``W``, ``Hz``, ``K`` for temperature in Kelvin).
713+
713714
- For ``TOK.TELNO``, the ``val`` field contains a tuple: ``(number, cc)``
714715
where the first item is the phone number
715716
in a normalized ``NNN-NNNN`` format, i.e. always including a hyphen,
@@ -733,8 +734,8 @@ An example is *o.s.frv.*, which results in a ``val`` field equal to
733734
``[('og svo framvegis', 0, 'ao', 'frasi', 'o.s.frv.', '-')]``.
734735

735736
The tuple format is designed to be compatible with the
736-
*Database of Modern Icelandic Inflection* (*DMII*),
737-
*Beygingarlýsing íslensks nútímamáls*.
737+
*Database of Icelandic Morphology* (*DIM*),
738+
*Beygingarlýsing íslensks nútímamáls*, i.e. the so-called *Sigrúnarsnið*.
738739

739740

740741
Development installation
@@ -804,6 +805,8 @@ can be found in the file ``test/toktest_normal_gold_expected.txt``.
804805
Changelog
805806
---------
806807

808+
* Version 3.0.0: Added tracking of character offsets for tokens within the
809+
original source text. Added full type annotations. Dropped Python 2.7 support.
807810
* Version 2.5.0: Added arguments for all tokenizer options to the
808811
command-line tool. Type annotations enhanced.
809812
* Version 2.4.0: Fixed bug where certain well-known word forms (**, *fær*, *mín*, **...)

setup.py

Lines changed: 8 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
#!/usr/bin/env python
2-
# -*- encoding: utf-8 -*-
32
"""
43
54
Tokenizer for Icelandic text
@@ -30,16 +29,17 @@
3029
3130
"""
3231

33-
from __future__ import absolute_import
32+
from typing import Any
3433

3534
import io
3635
import re
36+
3737
from glob import glob
3838
from os.path import basename, dirname, join, splitext
3939
from setuptools import find_packages, setup # type: ignore
4040

4141

42-
def read(*names, **kwargs):
42+
def read(*names: str, **kwargs: Any) -> str:
4343
try:
4444
return io.open(
4545
join(dirname(__file__), *names),
@@ -48,13 +48,16 @@ def read(*names, **kwargs):
4848
except (IOError, OSError):
4949
return ""
5050

51+
# Load version string from file
52+
__version__ = "[missing]"
53+
exec(open(join("src", "tokenizer", "version.py")).read())
5154

5255
setup(
5356
name="tokenizer",
54-
version="2.5.0", # Also update src/tokenizer/__init__.py
57+
version=__version__,
5558
license="MIT",
5659
description="A tokenizer for Icelandic text",
57-
long_description=u"{0}\n{1}".format(
60+
long_description="{0}\n{1}".format(
5861
re.compile("^.. start-badges.*^.. end-badges", re.M | re.S)
5962
.sub("", read("README.rst")
6063
),
@@ -79,9 +82,7 @@ def read(*names, **kwargs):
7982
"Operating System :: Microsoft :: Windows",
8083
"Natural Language :: Icelandic",
8184
"Programming Language :: Python",
82-
"Programming Language :: Python :: 2.7",
8385
"Programming Language :: Python :: 3",
84-
"Programming Language :: Python :: 3.5",
8586
"Programming Language :: Python :: 3.6",
8687
"Programming Language :: Python :: 3.7",
8788
"Programming Language :: Python :: 3.8",
@@ -93,11 +94,6 @@ def read(*names, **kwargs):
9394
"Topic :: Text Processing :: Linguistic",
9495
],
9596
keywords=["nlp", "tokenizer", "icelandic"],
96-
# Install the typing module if it isn't already in the
97-
# Python standard library (i.e. in versions prior to 3.5)
98-
install_requires=[
99-
"typing;python_version<'3.5'"
100-
],
10197
# Set up a tokenize command (tokenize.exe on Windows),
10298
# which calls main() in src/tokenizer/main.py
10399
entry_points={

src/tokenizer/__init__.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
# -*- encoding: utf-8 -*-
21
"""
32
43
Copyright(C) 2021 Miðeind ehf.
@@ -27,19 +26,20 @@
2726
2827
"""
2928

30-
from __future__ import absolute_import
31-
3229
from .definitions import (
3330
TP_LEFT, TP_CENTER, TP_RIGHT, TP_NONE, TP_WORD,
3431
EN_DASH, EM_DASH,
35-
KLUDGY_ORDINALS_PASS_THROUGH, KLUDGY_ORDINALS_MODIFY, KLUDGY_ORDINALS_TRANSLATE
32+
KLUDGY_ORDINALS_PASS_THROUGH, KLUDGY_ORDINALS_MODIFY, KLUDGY_ORDINALS_TRANSLATE,
33+
BIN_Tuple, BIN_TupleList
3634
)
3735
from .tokenizer import (
3836
TOK, Tok, tokenize, tokenize_without_annotation, split_into_sentences,
3937
parse_tokens, correct_spaces, detokenize, mark_paragraphs, paragraphs,
40-
normalized_text, normalized_text_from_tokens, text_from_tokens
38+
normalized_text, normalized_text_from_tokens, text_from_tokens,
39+
calculate_indexes, generate_rough_tokens
4140
)
4241
from .abbrev import Abbreviations, ConfigError
42+
from .version import __version__
4343

44-
__author__ = u"Miðeind ehf"
45-
__version__ = u"2.5.0" # Also update setup.py
44+
__author__ = "Miðeind ehf"
45+
__copyright__ = "(C) 2021 Miðeind ehf."

0 commit comments

Comments
 (0)