Skip to content

Commit bed46a2

Browse files
Update README.rst
1 parent 52cc1e6 commit bed46a2

File tree

1 file changed

+16
-16
lines changed

1 file changed

+16
-16
lines changed

README.rst

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,8 @@ two tokens each, output with a single space between them.
5151

5252
In deep tokenization, the same strings are represented by single token objects,
5353
of type ``TOK.MEASUREMENT``, ``TOK.DATEREL`` and ``TOK.TELNO``, respectively.
54-
The text associated with a single token object may contain one or more spaces,
55-
although consecutive space is always coalesced.
54+
The text associated with a single token object may contain spaces,
55+
although consecutive whitespace is always coalesced into a single space ``" "``.
5656

5757
By default, the command line tool performs shallow tokenization. If you
5858
want deep tokenization with the command line tool, use the ``--json`` or
@@ -83,11 +83,11 @@ the command line:
8383
8484
$ tokenize input.txt output.txt
8585
86-
Input and output files are encoded in UTF-8. If the files are not
86+
Input and output files are in UTF-8 encoding. If the files are not
8787
given explicitly, ``stdin`` and ``stdout`` are used for input and output,
8888
respectively.
8989

90-
Empty lines in the input are treated as sentence boundaries.
90+
Empty lines in the input are treated as hard sentence boundaries.
9191

9292
By default, the output consists of one sentence per line, where each
9393
line ends with a single newline character (ASCII LF, ``chr(10)``, ``"\n"``).
@@ -109,28 +109,27 @@ Other options can be specified on the command line:
109109

110110
+-----------------------------------+---------------------------------------------------+
111111
| | ``-n`` | Normalize punctuation, causing e.g. quotes to be |
112-
| | ``--normalize`` | output in Icelandic form and hyphens to be |
113-
| | regularized. This option is only applicable to |
112+
| | | output in Icelandic form and hyphens to be |
113+
| | ``--normalize`` | regularized. This option is only applicable to |
114114
| | shallow tokenization. |
115115
+-----------------------------------+---------------------------------------------------+
116-
| | ``-s`` | Input contains strictly one sentence per line. |
116+
| | ``-s`` | Input contains strictly one sentence per line, |
117+
| | | i.e. every newline is a sentence boundary. |
117118
| | ``--one_sent_per_line`` | |
118119
+-----------------------------------+---------------------------------------------------+
119120
| | ``-m`` | Degree signal in tokens denoting temperature |
120121
| | ``--convert_measurements`` | normalized (200° C -> 200 °C) |
121122
+-----------------------------------+---------------------------------------------------+
122-
| | ``-a`` | Additional annotation, usually handled by |
123-
| | ``--with_annotation`` | GreynirPackage, added to tokens. |
124-
+-----------------------------------+---------------------------------------------------+
125123
| | ``-p`` | Numbers combined into one token with the |
126124
| | ``--coalesce_percent`` | following token denoting percentage word forms |
127-
| | (prósent, prósentustig, hundraðshlutar) |
125+
| | (*prósent*, *prósentustig*, *hundraðshlutar*) |
128126
+-----------------------------------+---------------------------------------------------+
129-
| | ``-g`` | Composite glyphs not replaced with a single |
130-
| | ``--keep_composite_glyphs`` | code point, so a ́' is not replaced with á |
127+
| | ``-g`` | Do not replace composite glyphs using Unicode |
128+
| | ``--keep_composite_glyphs`` | COMBINING codes with their accented/umlaut |
129+
| | counterparts |
131130
+-----------------------------------+---------------------------------------------------+
132-
| | ``-e`` | HTML escape codes replaced, |
133-
| | ``--replace_html_escapes`` | such as 'á' -> 'á' |
131+
| | ``-e`` | HTML escape codes replaced by their meaning, |
132+
| | ``--replace_html_escapes`` | such as ``á`` -> ``á`` |
134133
+-----------------------------------+---------------------------------------------------+
135134
| | ``-c`` | English-style decimal points and thousands |
136135
| | ``--convert_numbers`` | separators in numbers changed to Icelandic style |
@@ -142,7 +141,6 @@ Other options can be specified on the command line:
142141
+-----------------------------------+---------------------------------------------------+
143142

144143

145-
146144
Type ``tokenize -h`` or ``tokenize --help`` to get a short help message.
147145

148146
Example
@@ -806,6 +804,8 @@ can be found in the file ``test/toktest_normal_gold_expected.txt``.
806804
Changelog
807805
---------
808806

807+
* Version 2.5.0: Added arguments for all tokenizer options to the
808+
command-line tool. Type annotations enhanced.
809809
* Version 2.4.0: Fixed bug where certain well-known word forms (**, *fær*, *mín*, **...)
810810
were being interpreted as (wrong) abbreviations. Also fixed bug where certain
811811
abbreviations were being recognized even in uppercase and at the end

0 commit comments

Comments
 (0)