Chapter 11 · 16 min

Prepare a dataset

Move off toy data — load Shakespeare, tokenize with your BPE, build train/val splits, save tensors ready for training.

Up to here, your local project has been small enough to inspect by hand: toy text, list-based math, tiny matrices, no heavy dependencies. That was useful. It kept every part visible.

Now the project changes scale. We keep the same folder, but replace toy strings with a real corpus on disk. Even a small language model wants direct access to files, repeatable preprocessing, and a train/validation split you can trust.

This chapter does the boring-but-load-bearing part: get a corpus, tokenize it, save it as a single contiguous binary file your training loop can read fast.

Heads up: the OS selector at the top of the page matters. Commands switch between macOS/Linux (POSIX shells) and Windows (PowerShell) depending on what you picked. Pick the one that matches your machine.

What we're aiming for

The deliverable of this chapter is train.bin and val.bin on your disk: two binary files containing the entire (tokenized) training corpus as a stream of 16-bit unsigned integers. That's the format nanoGPT and most teaching-grade transformer trainers expect. Reading 16 GB of .bin files is faster than re-tokenizing every epoch. The data is split into train and validation sets.

Two browser cells first to ground the concept, then we write the preprocessing script in your existing my-llm/ folder.

1. Get a feel for what a tokenized dataset looks like

The browser cell below uses a fragment of a TinyStories-style children's story and a deliberately simple "tokenizer" (lowercase, strip punctuation, split on whitespace) so you can see the shape of the result. The real preprocessing on your machine will use a BPE tokenizer like the one from chapter 3, but the output is the same shape: a stream of integer token IDs.

Code · JavaScript

const text = "Once upon a time, there was a small grey cat. The cat lived in a little house with a red door. Every day, the cat sat on the mat and watched the birds outside the window. One morning, a friendly brown dog came to visit. The dog had a long tail and a wet nose. The cat and the dog became good friends. They played in the garden every afternoon until the sun went down.";

const { tokens, vocab } = encode(text);
const counts = new Map();
for (const t of tokens) counts.set(t, (counts.get(t) ?? 0) + 1);
const topK = Array.from(counts.entries())
  .sort((a, b) => b[1] - a[1])
  .slice(0, 8);
return {
  numTokens: tokens.length,
  vocabSize: vocab.length,
  compressionRatio: tokens.length / text.length,
  topK,
};

Notice the compression ratio — about 0.2 tokens per character with this naive tokenizer. A real BPE tokenizer typically lands at ~0.25-0.3 for English text (more efficient than character-level, less efficient than word-level for the rare-word tail).

For a corpus of N characters and a vocab of V unique tokens, the final train.bin size is 2 × num_tokens bytes (we use uint16 since 65,536 is enough vocabulary for our scale). For TinyStories with ~5M tokens, that's ~10 MB. Tiny by deep-learning standards.

2. From a token stream to training pairs

The training loop in chapter 13 will repeatedly:

Pick a random starting position in train.bin.
Read the next block_size tokens as the input context.
Use the very next token as the target the model should predict.

That's the "sliding window of (context, target) pairs" we've been alluding to since chapter 1. Write the function that generates them.

Code · JavaScript

Each pair is one training example. In practice your training loop doesn't pre-compute all pairs — it samples random starting positions on the fly. But the conceptual model is the same: every position in the corpus is a training example, with the previous block_size tokens as the input and the current token as the target.

3. Install the data libraries

Activate the virtual environment you created in chapter 1:

cd my-llm && source .venv/bin/activate

cd my-llm; .\\.venv\\Scripts\\Activate.ps1

cd my-llm && source .venv/bin/activate

Then install the small handful of libraries this chapter needs:

numpy for the binary serialization
tiktoken for GPT-style BPE tokenization (we'll use the GPT-2 tokenizer to skip having to train our own)
requests for downloading the corpus

pip install numpy tiktoken requests

pip install numpy tiktoken requests

pip install numpy tiktoken requests

If everything installed cleanly, python -c "import numpy, tiktoken, requests; print('ok')" should print ok and nothing else.

4. Download a small corpus into `data/`

The classic teaching-scale dataset is TinyShakespeare (1.1 MB, the complete works), which fits in memory and trains in a few minutes on CPU. Karpathy ships it as a single text file.

curl -L -o data/input.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

Invoke-WebRequest -Uri https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -OutFile data\\input.txt

curl -L -o data/input.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

You should now have data/input.txt (~1.1 MB). Inspect it:

head -n 20 data/input.txt

Get-Content data\\input.txt -TotalCount 20

head -n 20 data/input.txt

That's the data. ~40,000 lines of Shakespeare, ~5 million characters.

5. The preprocessing script

Save this as scripts/prepare.py:

"""prepare.py — tokenize input.txt and save train.bin / val.bin."""
import numpy as np
import tiktoken
 
from pathlib import Path
 
# [1]
data_dir = Path("data")
 
with open(data_dir / "input.txt", "r", encoding="utf-8") as f:
    # [2]
    text = f.read()
 
# GPT-2 tokenizer: 50_257 entries, well-tested, no training needed.
# [3]
enc = tiktoken.get_encoding("gpt2")
ids = enc.encode_ordinary(text)
 
print(f"corpus: {len(text):,} characters → {len(ids):,} tokens")
print(f"compression: {len(ids) / len(text):.3f} tokens/char")
print(f"vocab size: {enc.n_vocab}")
 
# 90/10 train/val split.
# [4]
split = int(0.9 * len(ids))
# [5]
train_ids = np.array(ids[:split], dtype=np.uint16)
val_ids = np.array(ids[split:], dtype=np.uint16)
 
# [6]
train_ids.tofile(data_dir / "train.bin")
val_ids.tofile(data_dir / "val.bin")
 
print(f"wrote data/train.bin ({train_ids.nbytes:,} bytes) and data/val.bin ({val_ids.nbytes:,} bytes)")

Read the preprocessing script as a pipeline:

[1] Path("data") makes the script independent of your operating system's path separator.
[2] f.read() loads the raw corpus exactly once.
[3] enc.encode_ordinary(text) turns text into GPT-2 token ids. This is the production-strength version of the toy tokenizer from chapter 3.
[4] split holds back 10% of the ids for validation. The model must not train on those. The is used to monitor training progress.
[5] np.uint16 stores each token id in two bytes. GPT-2's vocab fits because it has 50,257 entries.
[6] tofile writes raw binary ids. This is less readable than JSON, but much faster for the training loop.

Then run it:

python -m scripts.prepare

python -m scripts.prepare

python -m scripts.prepare

You should see something like:

corpus: 1,115,394 characters → 301,966 tokens
compression: 0.271 tokens/char
vocab size: 50257
wrote data/train.bin (543,538 bytes) and data/val.bin (60,394 bytes)

That's it. data/train.bin and data/val.bin are now sitting on your disk, ready for chapter 13's training loop.

6. Verify what's in train.bin

Save this short script as scripts/verify_data.py:

"""verify_data.py — confirm train.bin / val.bin are on the rails."""
import numpy as np
import tiktoken
 
 
# [1]
ids = np.fromfile("data/train.bin", dtype=np.uint16)
val_ids = np.fromfile("data/val.bin", dtype=np.uint16)
enc = tiktoken.get_encoding("gpt2")
 
# [2]
total = ids.size + val_ids.size
assert 280_000 < total < 330_000, f"expected ~302k total tokens, got {total:,}"
val_ratio = val_ids.size / total
assert 0.05 < val_ratio < 0.15, f"expected ~10% val split, got {val_ratio:.1%}"
 
# [3]
decoded = enc.decode(ids[:50].tolist())
assert "First" in decoded or "Citizen" in decoded, (
    f"decoded prefix does not look like TinyShakespeare: {decoded!r}"
)
 
# [4]
print(f"✓ {ids.size:,} train tokens, {val_ids.size:,} val tokens ({val_ratio:.1%} val)")
print(f"✓ first 20 ids: {ids[:20].tolist()}")
print(f"✓ decoded prefix: {enc.decode(ids[:20].tolist())!r}")

This verifier turns the round trip into pass/fail signals:

[1] np.fromfile reads raw integers back from disk; get_encoding("gpt2") recreates the tokenizer used during preprocessing.
[2] asserts the total token count and the train/val split match what prepare.py produced. If you accidentally re-ran with a different corpus, you'll see it here.
[3] asserts the decoded prefix contains a string you would expect from the start of TinyShakespeare. Catches the case where the tokenizer and the corpus drifted out of sync.
[4] the prints land only when all three assertions pass. If you see ✓ on every line, your data path is sound.

Then run it:

python -m scripts.verify_data

python -m scripts.verify_data

python -m scripts.verify_data

You should see three ticks plus the first few words of Shakespeare:

✓ 271,769 train tokens, 30,197 val tokens (10.0% val)
✓ first 20 ids: [5962, 22307, 25, 198, 8421, ...]
✓ decoded prefix: 'First Citizen:\nBefore we proceed any further, ...'

If any assertion fires instead, the message tells you which step is off — wrong corpus, wrong tokenizer, or wrong file layout. The round-trip works: characters → → integers → binary file → integers → → characters.

Recap

data/train.bin and data/val.bin are the format teaching-grade trainers expect. A stream of uint16 token IDs on disk. - The tokenizer is BPE, same family as chapter 3. We use the GPT-2 tokenizer here so we don't have to retrain from scratch. - The training loop samples random starting positions in the file and reads block_size tokens at a time, with the next token as the target. - The data takes a single Python script to prepare. Most of the work in a real pipeline is finding a corpus and cleaning it; the tokenization and serialization are short. - Your local project now has real data instead of toy strings. The data is split into train and validation sets.

Going further

Karpathy's nanoGPT preprocessing scripts — same structure as ours, with more dataset choices (Shakespeare, OpenWebText, etc.).
tiktoken docs — the BPE tokenizer we used. Same one that GPT-2/3/4 use.
TinyStories — a slightly bigger corpus designed for small-model training research. Worth using if you want to scale beyond Shakespeare without leaving CPU.

Next up: the minimum code — the model itself, in fewer than 150 lines of PyTorch, with margin annotations pointing back at every chapter we've done so far.