Skip to content

I built a gpt-2 style tokenizer that can be trained on any .txt data to generate tokens

License

Notifications You must be signed in to change notification settings

sghawana/custom-gpt2-tokenizer

Repository files navigation

BPE tokenizer

This README describes a GPT-2 style tokenizer implemented using the Byte Pair Encoding (BPE) algorithm. The tokenizer is designed to handle UTF-8 encoded text, with specific considerations for various Unicode ranges.


Valid utf-8 token Sequences

Unicode Index No. of Bytes Byte1 Byte2 Byte3 Byte4
0-127 1 (0-127)
128-2047 2 (192-223) (128-191)
2048-65535 3 (224-239) (128-191) (128-191)
Rest 4 (240-247) (128-191) (128-191) (128-191)
  • Decimal representation for each byte
  • English Characters: 1 token
  • Hindi Characters: 3 Tokens

Usage

  1. Initialise variables in paths.py
    DIRECTORY: /path/to/main/directory
    SOURCE: /source/folder/containing/all/files

Note: The specified directory can include .txt files or subdirectories, they will be processed accordingly.

  1. The tokenizer trains on raw .txt files. Run the following command in corpus.py to generate training corpus
generate_corpus(DIRECTORY)
  1. To train the tokenizer of required vocabulary size
tok = Tokenizer()
tok.train(SOURCE, Vocab_size)

About

I built a gpt-2 style tokenizer that can be trained on any .txt data to generate tokens

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages