EBook Added to PreTraining

Sangwan70 · Dec 8, 2024 · 136817a · 136817a
1 parent f8c9f35
commit 136817a
Show file tree

Hide file tree

Showing 2 changed files with 5,440 additions and 8 deletions.
diff --git a/part_1/01_main-code/part_1.ipynb b/part_1/01_main-code/part_1.ipynb
@@ -18,10 +18,19 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "id": "4d1305cf-12d5-46fe-a2c9-36fb71c5b3d3",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "torch version: 2.4.0\n",
+      "tiktoken version: 0.8.0\n"
+     ]
+    }
+   ],
    "source": [
     "from importlib.metadata import version\n",
     "\n",
@@ -116,7 +125,7 @@
    "metadata": {},
    "source": [
     "- Load raw text we want to work with\n",
-    "- [The Verdict by Edith Wharton](https://en.wikisource.org/wiki/The_Verdict) is a public domain short story"
+    "- [The Wonderful Wizard of Oz](https://www.gutenberg.org/cache/epub/55/pg55.txt) from Project Gutenberg. Project Gutenberg is a library of over 70,000 free eBooks"
    ]
   },
   {
@@ -130,8 +139,8 @@
     "import urllib.request\n",
     "\n",
     "if not os.path.exists(\"the-verdict.txt\"):\n",
-    "    url = (\"https://raw.githubusercontent.com/Sangwan70/Building-an-LLM-From-Scratch/refs/heads/main/part_1/01_main-code/the-verdict.txt\")\n",
-    "    file_path = \"the-verdict.txt\"\n",
+    "    url = (\"https://raw.githubusercontent.com/Sangwan70/Building-an-LLM-From-Scratch/refs/heads/main/part_1/01_main-code/wizard_of_oz.txt\")\n",
+    "    file_path = \"wizard_of_oz.txt\"\n",
     "    urllib.request.urlretrieve(url, file_path)"
    ]
   },
@@ -523,7 +532,7 @@
     "  - `[PAD]` (padding) if we train LLMs with a batch size greater than 1 (we may include multiple texts with different lengths; with the padding token we pad the shorter texts to the longest length so that all texts have an equal length)\n",
     "- `[UNK]` to represent words that are not included in the vocabulary\n",
     "\n",
-    "- Note that GPT-2 does not need any of these tokens mentioned above but only uses an `<|endoftext|>` token to reduce complexity\n",
+    "- GPT-2 does not need any of these tokens mentioned above but only uses an `<|endoftext|>` token to reduce complexity\n",
     "- The `<|endoftext|>` is analogous to the `[EOS]` token mentioned above\n",
     "- GPT also uses the `<|endoftext|>` for padding (since we typically use a mask when training on batched inputs, we would not attend padded tokens anyways, so it does not matter what these tokens are)\n",
     "- GPT-2 does not use an `<UNK>` token for out-of-vocabulary words; instead, GPT-2 uses a byte-pair encoding (BPE) tokenizer, which breaks down words into subword units which we will discuss in a later section\n",
@@ -712,7 +721,7 @@
     "- it allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words\n",
     "- For instance, if GPT-2's vocabulary doesn't have the word \"unfamiliarword,\" it might tokenize it as [\"unfam\", \"iliar\", \"word\"] or some other subword breakdown, depending on its trained BPE merges\n",
     "- The original BPE tokenizer can be found here: [https://github.com/openai/gpt-2/blob/master/src/encoder.py](https://github.com/openai/gpt-2/blob/master/src/encoder.py)\n",
-    "- In this chapter, we are using the BPE tokenizer from OpenAI's open-source [tiktoken](https://github.com/openai/tiktoken) library, which implements its core algorithms in Rust to improve computational performance\n",
+    "- In this lab, we are using the BPE tokenizer from open-source [tiktoken](https://github.com/openai/tiktoken) library, which implements its core algorithms in Rust to improve computational performance\n",
     "- I created a notebook in the [./bytepair_encoder](../02_bonus_bytepair-encoder) that compares these two implementations side-by-side (tiktoken was about 5x faster on the sample text)"
    ]
   },
@@ -825,7 +834,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n",
+    "with open(\"wizard_of_oz.txt\", \"r\", encoding=\"utf-8\") as f:\n",
     "    raw_text = f.read()\n",
     "\n",
     "enc_text = tokenizer.encode(raw_text)\n",