@@ -51,8 +51,8 @@ two tokens each, output with a single space between them.
51
51
52
52
In deep tokenization, the same strings are represented by single token objects,
53
53
of type ``TOK.MEASUREMENT ``, ``TOK.DATEREL `` and ``TOK.TELNO ``, respectively.
54
- The text associated with a single token object may contain one or more spaces,
55
- although consecutive space is always coalesced.
54
+ The text associated with a single token object may contain spaces,
55
+ although consecutive whitespace is always coalesced into a single space `` " " `` .
56
56
57
57
By default, the command line tool performs shallow tokenization. If you
58
58
want deep tokenization with the command line tool, use the ``--json `` or
@@ -83,11 +83,11 @@ the command line:
83
83
84
84
$ tokenize input.txt output.txt
85
85
86
- Input and output files are encoded in UTF-8. If the files are not
86
+ Input and output files are in UTF-8 encoding . If the files are not
87
87
given explicitly, ``stdin `` and ``stdout `` are used for input and output,
88
88
respectively.
89
89
90
- Empty lines in the input are treated as sentence boundaries.
90
+ Empty lines in the input are treated as hard sentence boundaries.
91
91
92
92
By default, the output consists of one sentence per line, where each
93
93
line ends with a single newline character (ASCII LF, ``chr(10) ``, ``"\n" ``).
@@ -109,28 +109,27 @@ Other options can be specified on the command line:
109
109
110
110
+-----------------------------------+---------------------------------------------------+
111
111
| | ``-n `` | Normalize punctuation, causing e.g. quotes to be |
112
- | | `` --normalize `` | output in Icelandic form and hyphens to be |
113
- | | regularized. This option is only applicable to |
112
+ | | | output in Icelandic form and hyphens to be |
113
+ | | `` --normalize `` | regularized. This option is only applicable to |
114
114
| | shallow tokenization. |
115
115
+-----------------------------------+---------------------------------------------------+
116
- | | ``-s `` | Input contains strictly one sentence per line. |
116
+ | | ``-s `` | Input contains strictly one sentence per line, |
117
+ | | | i.e. every newline is a sentence boundary. |
117
118
| | ``--one_sent_per_line `` | |
118
119
+-----------------------------------+---------------------------------------------------+
119
120
| | ``-m `` | Degree signal in tokens denoting temperature |
120
121
| | ``--convert_measurements `` | normalized (200° C -> 200 °C) |
121
122
+-----------------------------------+---------------------------------------------------+
122
- | | ``-a `` | Additional annotation, usually handled by |
123
- | | ``--with_annotation `` | GreynirPackage, added to tokens. |
124
- +-----------------------------------+---------------------------------------------------+
125
123
| | ``-p `` | Numbers combined into one token with the |
126
124
| | ``--coalesce_percent `` | following token denoting percentage word forms |
127
- | | (prósent, prósentustig, hundraðshlutar) |
125
+ | | (* prósent *, * prósentustig *, * hundraðshlutar *) |
128
126
+-----------------------------------+---------------------------------------------------+
129
- | | ``-g `` | Composite glyphs not replaced with a single |
130
- | | ``--keep_composite_glyphs `` | code point, so a ́' is not replaced with á |
127
+ | | ``-g `` | Do not replace composite glyphs using Unicode |
128
+ | | ``--keep_composite_glyphs `` | COMBINING codes with their accented/umlaut |
129
+ | | counterparts |
131
130
+-----------------------------------+---------------------------------------------------+
132
- | | ``-e `` | HTML escape codes replaced, |
133
- | | ``--replace_html_escapes `` | such as ' á' -> 'á' |
131
+ | | ``-e `` | HTML escape codes replaced by their meaning, |
132
+ | | ``--replace_html_escapes `` | such as `` á `` -> `` á `` |
134
133
+-----------------------------------+---------------------------------------------------+
135
134
| | ``-c `` | English-style decimal points and thousands |
136
135
| | ``--convert_numbers `` | separators in numbers changed to Icelandic style |
@@ -142,7 +141,6 @@ Other options can be specified on the command line:
142
141
+-----------------------------------+---------------------------------------------------+
143
142
144
143
145
-
146
144
Type ``tokenize -h `` or ``tokenize --help `` to get a short help message.
147
145
148
146
Example
@@ -806,6 +804,8 @@ can be found in the file ``test/toktest_normal_gold_expected.txt``.
806
804
Changelog
807
805
---------
808
806
807
+ * Version 2.5.0: Added arguments for all tokenizer options to the
808
+ command-line tool. Type annotations enhanced.
809
809
* Version 2.4.0: Fixed bug where certain well-known word forms (*fá *, *fær *, *mín *, *sá *...)
810
810
were being interpreted as (wrong) abbreviations. Also fixed bug where certain
811
811
abbreviations were being recognized even in uppercase and at the end
0 commit comments