BPE tokenizer

This README describes a GPT-2 style tokenizer implemented using the Byte Pair Encoding (BPE) algorithm. The tokenizer is designed to handle UTF-8 encoded text, with specific considerations for various Unicode ranges.

Valid utf-8 token Sequences

Unicode Index	No. of Bytes	Byte1	Byte2	Byte3	Byte4
0-127	1	(0-127)
128-2047	2	(192-223)	(128-191)
2048-65535	3	(224-239)	(128-191)	(128-191)
Rest	4	(240-247)	(128-191)	(128-191)	(128-191)

Decimal representation for each byte
English Characters: 1 token
Hindi Characters: 3 Tokens

Usage

Initialise variables in paths.py
DIRECTORY: /path/to/main/directory
SOURCE: /source/folder/containing/all/files

Note: The specified directory can include .txt files or subdirectories, they will be processed accordingly.

The tokenizer trains on raw .txt files. Run the following command in corpus.py to generate training corpus

generate_corpus(DIRECTORY)

To train the tokenizer of required vocabulary size

tok = Tokenizer()
tok.train(SOURCE, Vocab_size)

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
src		src
.gitignore		.gitignore
.python-version		.python-version
FETCH_HEAD		FETCH_HEAD
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.lock		requirements-dev.lock
requirements.lock		requirements.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BPE tokenizer

Valid utf-8 token Sequences

Usage

About

Releases

Packages

Languages

License

sghawana/custom-gpt2-tokenizer

Folders and files

Latest commit

History

Repository files navigation

BPE tokenizer

Valid utf-8 token Sequences

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages