Guide
Train a small language model — locally
A practical walk-through for training a small GPT-style model from scratch on a single GPU or CPU. Dataset, model, training loop, checkpoint.
You can train a small language model on a normal laptop. This guide is a practical walk-through — dataset, model, training loop, evaluation — built on top of the chapters in The Loss Curve.
1. Decide on size
The two knobs that matter for "small":
- Model size. The course's reference is ~14M parameters: 6 Transformer layers, 6 heads, embedding dim 384, block size 256. Trains comfortably on CPU.
- Dataset size. A few megabytes of text gets you readable output. Tens of megabytes and the model improves noticeably; hundreds, and the tiny architecture caps out.
For perspective: GPT-2 small is 124M parameters trained on ~40 GB of text. You're aiming for something like 1/10th the parameters and 1/40000th the data. Output quality won't match, but training will fit on your laptop.
2. Prepare the data
Chapter 11 — Prepare a real training dataset loads a corpus, tokenizes it with the BPE tokenizer you trained in chapter 3, builds a train/val split, and saves the tokens as data/train.bin and data/val.bin.
This is the unsexy part — but skipping it makes everything later hurt. Tokenization choices (vocab size, what to do with newlines, byte-level vs char-level) lock in the rest of the training run.
3. Write the model
Chapter 12 — A GPT model in ~150 lines of PyTorch writes a complete GPT-style decoder. Six Transformer blocks. AdamW. Cross-entropy loss. Weight tying between embeddings and the output head.
config = GPTConfig(
vocab_size=10000,
block_size=256,
n_layer=6,
n_head=6,
n_embd=384,
dropout=0.0,
)
model = GPT(config)That's the entire architecture spec.
4. Train
Chapter 13 — The training loop ships the loop:
- batched
(input, target)sampling fromtrain.bin - AdamW with weight decay 0.1, betas (0.9, 0.95)
- learning rate warmup + cosine decay
- gradient clipping at 1.0
- validation loss every N steps
- checkpoint saved on best validation
The training visualization in chapter 13 plots train vs validation loss live. You can read it like a clinician — overfitting, underfitting, LR too high, LR too low, all have characteristic shapes.
5. Watch the loss curve
The training curve is the most important diagnostic you have. The Loss Curve takes its name from this fact.
- Both curves dropping in parallel: training is healthy, model has capacity left, keep going.
- Validation flattens while training keeps dropping: overfitting. Stop training, or get more data.
- Both flatten high: under-capacity model or learning rate too low.
- Loss spikes or diverges: gradient explosion. Drop the learning rate or strengthen clipping.
Chapter 13 walks through each case with concrete examples.
6. Generate
Once trained, chapter 14 — Sampling turns the model into a text generator with temperature, top-K, and top-P controls. The same checkpoint will produce wildly different outputs at different settings — worth experimenting with before deciding the model is "done."
7. Loading larger pre-trained models
If your tiny model isn't enough, chapter 15 shows that the same architecture can run GPT-2 124M with OpenAI's weights — orders of magnitude bigger, still fits on a laptop. Then chapter 17 makes it follow instructions and chapter 21 ships a specialized variant.
The path from a 14M model you trained to a 124M model you fine-tuned is one extra chapter and a 500 MB download.
8. Where to go next
- For a higher-level overview: Build an LLM from Scratch.
- For the full PyTorch tour: PyTorch LLM Tutorial.
- For instruction-tuning your trained model: chapter 17 + LoRA guide.
Frequently asked questions
Can I really train a language model on my laptop?
A small one, yes. The Loss Curve trains a tiny GPT-2-style model (~14M parameters) on a few megabytes of text in under an hour on CPU, much faster on a consumer GPU. It won't beat GPT-4 — but you'll understand every part of it.
What hardware do I need?
Nothing special. The reference run works on a 2020-era laptop CPU. A consumer GPU (RTX 3060+, M1+ on Mac via MPS) drops training time from ~30 minutes to a few minutes. No NVIDIA cards required.
How much data should I use?
A few megabytes is enough to get readable output. The course's reference corpus is ~1 MB of Shakespeare. With more data the model gets noticeably better up to a point — but the bottleneck on a tiny model is capacity, not data.
What's a good loss to aim for?
It depends on your tokenizer and corpus. On the included Shakespeare-like corpus with a small BPE vocabulary, validation loss around 2.0 is typical and produces readable samples. If validation loss diverges from training loss, you're overfitting — try less training or more data.
How long does training take?
Chapter 13's reference run is 15–30 minutes on a modern laptop CPU and 3–5 minutes on a consumer GPU. Smaller models train faster; larger models extend proportionally with data and compute.