Skip to content

Commit

Permalink
EBook Added to PreTraining
Browse files Browse the repository at this point in the history
  • Loading branch information
AllianceSoftech committed Dec 8, 2024
1 parent f8c9f35 commit 136817a
Show file tree
Hide file tree
Showing 2 changed files with 5,440 additions and 8 deletions.
25 changes: 17 additions & 8 deletions part_1/01_main-code/part_1.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,19 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"id": "4d1305cf-12d5-46fe-a2c9-36fb71c5b3d3",
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"torch version: 2.4.0\n",
"tiktoken version: 0.8.0\n"
]
}
],
"source": [
"from importlib.metadata import version\n",
"\n",
Expand Down Expand Up @@ -116,7 +125,7 @@
"metadata": {},
"source": [
"- Load raw text we want to work with\n",
"- [The Verdict by Edith Wharton](https://en.wikisource.org/wiki/The_Verdict) is a public domain short story"
"- [The Wonderful Wizard of Oz](https://www.gutenberg.org/cache/epub/55/pg55.txt) from Project Gutenberg. Project Gutenberg is a library of over 70,000 free eBooks"
]
},
{
Expand All @@ -130,8 +139,8 @@
"import urllib.request\n",
"\n",
"if not os.path.exists(\"the-verdict.txt\"):\n",
" url = (\"https://raw.githubusercontent.com/Sangwan70/Building-an-LLM-From-Scratch/refs/heads/main/part_1/01_main-code/the-verdict.txt\")\n",
" file_path = \"the-verdict.txt\"\n",
" url = (\"https://raw.githubusercontent.com/Sangwan70/Building-an-LLM-From-Scratch/refs/heads/main/part_1/01_main-code/wizard_of_oz.txt\")\n",
" file_path = \"wizard_of_oz.txt\"\n",
" urllib.request.urlretrieve(url, file_path)"
]
},
Expand Down Expand Up @@ -523,7 +532,7 @@
" - `[PAD]` (padding) if we train LLMs with a batch size greater than 1 (we may include multiple texts with different lengths; with the padding token we pad the shorter texts to the longest length so that all texts have an equal length)\n",
"- `[UNK]` to represent words that are not included in the vocabulary\n",
"\n",
"- Note that GPT-2 does not need any of these tokens mentioned above but only uses an `<|endoftext|>` token to reduce complexity\n",
"- GPT-2 does not need any of these tokens mentioned above but only uses an `<|endoftext|>` token to reduce complexity\n",
"- The `<|endoftext|>` is analogous to the `[EOS]` token mentioned above\n",
"- GPT also uses the `<|endoftext|>` for padding (since we typically use a mask when training on batched inputs, we would not attend padded tokens anyways, so it does not matter what these tokens are)\n",
"- GPT-2 does not use an `<UNK>` token for out-of-vocabulary words; instead, GPT-2 uses a byte-pair encoding (BPE) tokenizer, which breaks down words into subword units which we will discuss in a later section\n",
Expand Down Expand Up @@ -712,7 +721,7 @@
"- it allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words\n",
"- For instance, if GPT-2's vocabulary doesn't have the word \"unfamiliarword,\" it might tokenize it as [\"unfam\", \"iliar\", \"word\"] or some other subword breakdown, depending on its trained BPE merges\n",
"- The original BPE tokenizer can be found here: [https://github.com/openai/gpt-2/blob/master/src/encoder.py](https://github.com/openai/gpt-2/blob/master/src/encoder.py)\n",
"- In this chapter, we are using the BPE tokenizer from OpenAI's open-source [tiktoken](https://github.com/openai/tiktoken) library, which implements its core algorithms in Rust to improve computational performance\n",
"- In this lab, we are using the BPE tokenizer from open-source [tiktoken](https://github.com/openai/tiktoken) library, which implements its core algorithms in Rust to improve computational performance\n",
"- I created a notebook in the [./bytepair_encoder](../02_bonus_bytepair-encoder) that compares these two implementations side-by-side (tiktoken was about 5x faster on the sample text)"
]
},
Expand Down Expand Up @@ -825,7 +834,7 @@
"metadata": {},
"outputs": [],
"source": [
"with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n",
"with open(\"wizard_of_oz.txt\", \"r\", encoding=\"utf-8\") as f:\n",
" raw_text = f.read()\n",
"\n",
"enc_text = tokenizer.encode(raw_text)\n",
Expand Down
Loading

0 comments on commit 136817a

Please sign in to comment.