Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

Appearance settings

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

mideind / Tokenizer Public

Notifications You must be signed in to change notification settings
Fork 7
Star 28

Code
Issues 3
Pull requests 1
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Releases: mideind/Tokenizer

Releases Tags

Releases · mideind/Tokenizer

Version 3.1.2

02 Jun 17:04

vthorsteinsson

3.1.2

2e1343a

Compare

Choose a tag to compare

View all tags

Version 3.1.2

Changed paragraph markers to be [[ and ]], i.e. without spaces, for better accuracy in character offset calculations.

Assets 2

Uh oh!

There was an error while loading. Please reload this page.

All reactions

Version 3.1.1

10 May 14:38

vthorsteinsson

3.1.1

2478dec

Compare

Choose a tag to compare

View all tags

Version 3.1.1

Minor fix; Tok.from_token() added

Assets 2

Uh oh!

There was an error while loading. Please reload this page.

All reactions

Version 3.1.0

29 Apr 10:24

vthorsteinsson

3.1.0

9b690b4

Compare

Choose a tag to compare

View all tags

Version 3.1.0

Added -o switch to tokenize command to return original token text, enabling the tokenizer to run as a sentence splitter only.

Assets 2

Uh oh!

There was an error while loading. Please reload this page.

All reactions

Version 3.0.0

09 Apr 16:01

vthorsteinsson

3.0.0

0e881d7

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

View all tags

Version 3.0.0

Added tracking of character offsets for tokens within the original source text.
Added full type annotations.
Dropped Python 2.7 support. Tokenizer now supports Python >= 3.6.

Assets 2

Uh oh!

There was an error while loading. Please reload this page.

All reactions

Version 2.5.0

08 Mar 11:45

vthorsteinsson

2.5.0

bed46a2

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

View all tags

Version 2.5.0

Added command-line arguments to the tokenizer executable, corresponding to available tokenization options
Updated and enhanced type annotations
Minor documentation edits

Assets 2

Uh oh!

There was an error while loading. Please reload this page.

All reactions

Version 2.4.0

08 Oct 12:02

vthorsteinsson

2.4.0

21b51f6

Compare

Choose a tag to compare

View all tags

Version 2.4.0

Fixed bug where certain well-known word forms (fá, fær, mín, sá...) were being interpreted as (wrong) abbreviations.
Also fixed bug where certain abbreviations were being recognized even in uppercase and at the end of a sentence, for instance Örn.

Assets 2

Uh oh!

There was an error while loading. Please reload this page.

All reactions

Version 2.3.1

21 Sep 12:03

vthorsteinsson

2.3.1

f4325b2

Compare

Choose a tag to compare

View all tags

Version 2.3.1

Various bug fixes; fixed type annotations for Python 2.7; the token kind NUMBER WITH LETTER is now NUMWLETTER.

Assets 2

Uh oh!

There was an error while loading. Please reload this page.

All reactions

Version 2.3.0

03 Sep 17:49

vthorsteinsson

2.3.0

e3d84e8

Compare

Choose a tag to compare

View all tags

Version 2.3.0

Added the replace_html_escapes option to the tokenize() function.

Assets 2

Uh oh!

There was an error while loading. Please reload this page.

All reactions

Version 2.2.0

20 Aug 22:16

vthorsteinsson

2.2.0

93fc648

Compare

Choose a tag to compare

View all tags

Version 2.2.0

Fixed correct_spaces() to handle compounds such as Atvinnu-, nýsköpunar- og ferðamálaráðuneytið and bensínstöðvar, -dælur og -tankar.

Assets 2

Uh oh!

There was an error while loading. Please reload this page.

All reactions

Version 2.1.0

02 Jul 16:05

vthorsteinsson

2.1.0

69a4443

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

View all tags

Version 2.1.0

Changed handling of periods at end of sentences if they are a part of an abbreviation. Now, the period is kept attached to the abbreviation, not split off into a separate period token, as before.

Assets 2

Uh oh!

There was an error while loading. Please reload this page.

All reactions

Previous 1 2 3 4 5 Next

Previous Next

Footer

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.