Skip to content
The loss curve

Guide

PyTorch LLM tutorial — from blank file to working GPT

A complete PyTorch LLM tutorial — model code in ~150 lines, training loop, optimizer, sampling. From a blank file to a working GPT.

This guide is a pointer into the PyTorch part of The Loss Curve — the chapters where the project moves from browser JavaScript to a working local LLM in PyTorch. If you arrived searching for "PyTorch LLM tutorial," start here, then follow the chapters in order.

1. The shape of the code

By the end of the PyTorch part, your my-llm/ project has:

my-llm/
├── llm/
│   ├── tokenizer.py    # BPE you trained in chapter 3
│   ├── model.py        # the GPT model (~150 lines)
│   └── train.py        # the training loop
├── data/
│   ├── train.bin       # tokenized training data
│   └── val.bin
└── checkpoints/
    └── ckpt.pt         # trained weights

Plain files. Plain PyTorch. Nothing exotic.

2. Preparing the dataset

Chapter 11 — Prepare a real training dataset loads a corpus (Shakespeare by default), tokenizes it with the BPE tokenizer you trained earlier, builds a train/val split, and saves the tokens as a tensor on disk.

The training loop reads train.bin directly via numpy.memmap — efficient, no copies, scales to corpora larger than RAM.

3. The model

Chapter 12 — A GPT model in ~150 lines of PyTorch writes the whole model:

class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.token_emb = nn.Embedding(config.vocab_size, config.n_embd)
        self.pos_emb = nn.Embedding(config.block_size, config.n_embd)
        self.blocks = nn.ModuleList([Block(config) for _ in range(config.n_layer)])
        self.ln_f = nn.LayerNorm(config.n_embd)
        self.head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.head.weight = self.token_emb.weight   # weight tying
 
    def forward(self, idx, targets=None):
        ...

The Block is multi-head attention + feed-forward with residuals and LayerNorm. Each piece was built by hand in chapters 8–10; chapter 12 is the cleanup pass.

4. The training loop

Chapter 13 — The training loop writes the loop:

for step in range(max_steps):
    x, y = get_batch("train")
    logits, loss = model(x, y)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
 
    if step % eval_interval == 0:
        val_loss = estimate_loss("val")
        save_checkpoint(model, optimizer, step, val_loss)

With AdamW, gradient clipping, periodic validation, and clean checkpointing. The chapter visualizes the loss curve as it runs.

5. Generation

Chapter 14 — Sampling: temperature, top-k, top-p writes the sampler. A trained model is just a function from token sequences to next-token distributions; how you sample from those distributions is the difference between a useful generator and a boring one.

6. Loading real weights

Chapter 15 — Load GPT-2 weights into your model demonstrates that your architecture is identical to GPT-2 small: the same GPT class can load OpenAI's 124M weights with a name-mapping table.

This is where the course pays off most concretely. You wrote the architecture. You trained it on your data. Now the same code runs a real frontier-era model.

7. Fine-tuning

From here, the rest of the course is about making a base model useful:

8. Other PyTorch LLM tutorials

Worth knowing about:

  • Karpathy's nanoGPT — denser, faster, optimized for training larger models on modest hardware.
  • minGPT — Karpathy's earlier and even more readable variant.
  • The Annotated Transformer — the classic walkthrough, encoder-decoder rather than decoder-only.

The Loss Curve is more spread out than any of these — it takes its time and explains each piece. That's the trade-off.

Frequently asked questions

How much PyTorch do I need to know?

Comfort with `nn.Module`, `forward`, `optim`, and a basic training loop. If you can write a working MLP that trains on MNIST, you can follow this — every Transformer-specific concept is explained inline.

Why ~150 lines? Couldn't it be smaller?

It could, with more density. The course optimizes for readability — separate functions for attention, the block, the model, and an explicit forward pass you can trace by eye. Shorter implementations exist (Karpathy's nanoGPT, for example); this one is built to be read.

What PyTorch version?

Any recent PyTorch (2.0+). The course uses standard `torch.nn` modules — no cutting-edge APIs. CPU works; CUDA and MPS work; nothing more exotic.

Does the training run on CPU?

Yes. The reference run uses a small model (~14M parameters) on a small corpus, so a few thousand steps finish in 15–30 minutes on a modern laptop CPU. GPU/MPS is faster but not required.

Continue learning