Chapter 11 · 16 min
Prepare a dataset
Move off toy data — load Shakespeare, tokenize with your BPE, build train/val splits, save tensors ready for training.
Up to here, your local project has been small enough to inspect by hand: toy text, list-based math, tiny matrices, no heavy dependencies. That was useful. It kept every part visible.
Now the project changes scale. We keep the same folder, but replace toy strings with a real corpus on disk. Even a small language model wants direct access to files, repeatable preprocessing, and a train/validation split you can trust.
This chapter does the boring-but-load-bearing part: get a corpus, tokenize it, save it as a single contiguous binary file your training loop can read fast.
Heads up: the OS selector at the top of the page matters. Commands switch between macOS/Linux (POSIX shells) and Windows (PowerShell) depending on what you picked. Pick the one that matches your machine.
What we're aiming for
The deliverable of this chapter is train.bin and val.bin on your disk: two binary files containing the entire (tokenized) training corpus as a stream of 16-bit unsigned integers. That's the format nanoGPT and most teaching-grade transformer trainers expect. Reading 16 GB of .bin files is faster than re-tokenizing every epoch. The data is split into train and validation sets.
Two browser cells first to ground the concept, then we write the preprocessing script in your existing my-llm/ folder.
1. Get a feel for what a tokenized dataset looks like
The browser cell below uses a fragment of a TinyStories-style children's story and a deliberately simple "tokenizer" (lowercase, strip punctuation, split on whitespace) so you can see the shape of the result. The real preprocessing on your machine will use a BPE tokenizer like the one from chapter 3, but the output is the same shape: a stream of integer token IDs.
Code · JavaScript
Notice the compression ratio — about 0.2 tokens per character with this naive tokenizer. A real BPE tokenizer typically lands at ~0.25-0.3 for English text (more efficient than character-level, less efficient than word-level for the rare-word tail).
For a corpus of N characters and a vocab of V unique tokens, the final train.bin size is 2 × num_tokens bytes (we use uint16 since 65,536 is enough vocabulary for our scale). For TinyStories with ~5M tokens, that's ~10 MB. Tiny by deep-learning standards.
2. From a token stream to training pairs
The training loop in chapter 13 will repeatedly:
- Pick a random starting position in
train.bin. - Read the next
block_sizetokens as the input context. - Use the very next token as the target the model should predict.
That's the "sliding window of (context, target) pairs" we've been alluding to since chapter 1. Write the function that generates them.
Code · JavaScript
Each pair is one training example. In practice your training loop doesn't pre-compute all pairs — it samples random starting positions on the fly. But the conceptual model is the same: every position in the corpus is a training example, with the previous block_size tokens as the input and the current token as the target.
3. Install the data libraries
Activate the virtual environment you created in chapter 1:
cd my-llm && source .venv/bin/activatecd my-llm; .\\.venv\\Scripts\\Activate.ps1cd my-llm && source .venv/bin/activateThen install the small handful of libraries this chapter needs:
numpyfor the binary serializationtiktokenfor GPT-style BPE tokenization (we'll use the GPT-2 tokenizer to skip having to train our own)requestsfor downloading the corpus
pip install numpy tiktoken requestspip install numpy tiktoken requestspip install numpy tiktoken requestsIf everything installed cleanly, python -c "import numpy, tiktoken, requests; print('ok')" should print ok and nothing else.
4. Download a small corpus into data/
The classic teaching-scale dataset is TinyShakespeare (1.1 MB, the complete works), which fits in memory and trains in a few minutes on CPU. Karpathy ships it as a single text file.
curl -L -o data/input.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txtInvoke-WebRequest -Uri https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -OutFile data\\input.txtcurl -L -o data/input.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txtYou should now have data/input.txt (~1.1 MB). Inspect it:
head -n 20 data/input.txtGet-Content data\\input.txt -TotalCount 20head -n 20 data/input.txtThat's the data. ~40,000 lines of Shakespeare, ~5 million characters.
5. The preprocessing script
Save this as scripts/prepare.py:
"""prepare.py — tokenize input.txt and save train.bin / val.bin."""
import numpy as np
import tiktoken
from pathlib import Path
# [1]
data_dir = Path("data")
with open(data_dir / "input.txt", "r", encoding="utf-8") as f:
# [2]
text = f.read()
# GPT-2 tokenizer: 50_257 entries, well-tested, no training needed.
# [3]
enc = tiktoken.get_encoding("gpt2")
ids = enc.encode_ordinary(text)
print(f"corpus: {len(text):,} characters → {len(ids):,} tokens")
print(f"compression: {len(ids) / len(text):.3f} tokens/char")
print(f"vocab size: {enc.n_vocab}")
# 90/10 train/val split.
# [4]
split = int(0.9 * len(ids))
# [5]
train_ids = np.array(ids[:split], dtype=np.uint16)
val_ids = np.array(ids[split:], dtype=np.uint16)
# [6]
train_ids.tofile(data_dir / "train.bin")
val_ids.tofile(data_dir / "val.bin")
print(f"wrote data/train.bin ({train_ids.nbytes:,} bytes) and data/val.bin ({val_ids.nbytes:,} bytes)")Read the preprocessing script as a pipeline:
- [1]
Path("data")makes the script independent of your operating system's path separator. - [2]
f.read()loads the raw corpus exactly once. - [3]
enc.encode_ordinary(text)turns text into GPT-2 token ids. This is the production-strength version of the toy tokenizer from chapter 3. - [4]
splitholds back 10% of the ids for validation. The model must not train on those. The is used to monitor training progress. - [5]
np.uint16stores each token id in two bytes. GPT-2's vocab fits because it has 50,257 entries. - [6]
tofilewrites raw binary ids. This is less readable than JSON, but much faster for the training loop.
Then run it:
python -m scripts.preparepython -m scripts.preparepython -m scripts.prepareYou should see something like:
corpus: 1,115,394 characters → 301,966 tokens
compression: 0.271 tokens/char
vocab size: 50257
wrote data/train.bin (543,538 bytes) and data/val.bin (60,394 bytes)
That's it. data/train.bin and data/val.bin are now sitting on your disk, ready for chapter 13's training loop.
6. Verify what's in train.bin
Save this short script as scripts/verify_data.py:
"""verify_data.py — confirm train.bin / val.bin are on the rails."""
import numpy as np
import tiktoken
# [1]
ids = np.fromfile("data/train.bin", dtype=np.uint16)
val_ids = np.fromfile("data/val.bin", dtype=np.uint16)
enc = tiktoken.get_encoding("gpt2")
# [2]
total = ids.size + val_ids.size
assert 280_000 < total < 330_000, f"expected ~302k total tokens, got {total:,}"
val_ratio = val_ids.size / total
assert 0.05 < val_ratio < 0.15, f"expected ~10% val split, got {val_ratio:.1%}"
# [3]
decoded = enc.decode(ids[:50].tolist())
assert "First" in decoded or "Citizen" in decoded, (
f"decoded prefix does not look like TinyShakespeare: {decoded!r}"
)
# [4]
print(f"✓ {ids.size:,} train tokens, {val_ids.size:,} val tokens ({val_ratio:.1%} val)")
print(f"✓ first 20 ids: {ids[:20].tolist()}")
print(f"✓ decoded prefix: {enc.decode(ids[:20].tolist())!r}")This verifier turns the round trip into pass/fail signals:
- [1]
np.fromfilereads raw integers back from disk;get_encoding("gpt2")recreates the tokenizer used during preprocessing. - [2] asserts the total token count and the train/val split match what
prepare.pyproduced. If you accidentally re-ran with a different corpus, you'll see it here. - [3] asserts the decoded prefix contains a string you would expect from the start of TinyShakespeare. Catches the case where the tokenizer and the corpus drifted out of sync.
- [4] the prints land only when all three assertions pass. If you see
✓on every line, your data path is sound.
Then run it:
python -m scripts.verify_datapython -m scripts.verify_datapython -m scripts.verify_dataYou should see three ticks plus the first few words of Shakespeare:
✓ 271,769 train tokens, 30,197 val tokens (10.0% val)
✓ first 20 ids: [5962, 22307, 25, 198, 8421, ...]
✓ decoded prefix: 'First Citizen:\nBefore we proceed any further, ...'
If any assertion fires instead, the message tells you which step is off — wrong corpus, wrong tokenizer, or wrong file layout. The round-trip works: characters → → integers → binary file → integers → → characters.
Recap
data/train.binanddata/val.binare the format teaching-grade trainers expect. A stream ofuint16token IDs on disk. - The tokenizer is BPE, same family as chapter 3. We use the GPT-2 tokenizer here so we don't have to retrain from scratch. - The training loop samples random starting positions in the file and readsblock_sizetokens at a time, with the next token as the target. - The data takes a single Python script to prepare. Most of the work in a real pipeline is finding a corpus and cleaning it; the tokenization and serialization are short. - Your local project now has real data instead of toy strings. The data is split into train and validation sets.
Going further
- Karpathy's nanoGPT preprocessing scripts — same structure as ours, with more dataset choices (Shakespeare, OpenWebText, etc.).
- tiktoken docs — the BPE tokenizer we used. Same one that GPT-2/3/4 use.
- TinyStories — a slightly bigger corpus designed for small-model training research. Worth using if you want to scale beyond Shakespeare without leaving CPU.
Next up: the minimum code — the model itself, in fewer than 150 lines of PyTorch, with margin annotations pointing back at every chapter we've done so far.