Skip to content
The loss curve

Guide

Build an LLM from scratch — a code-first walkthrough

A complete, code-first walkthrough for building a GPT-style LLM from scratch. Tokenizer, embeddings, Transformer, training, weights, fine-tuning.

The Loss Curve is a free, interactive course that walks you through building a GPT-style language model from scratch. Every chapter pairs a short explanation with code you can run — first in your browser, then locally in your own my-llm/ project. By the end you have a working model you understand because you wrote every part of it.

1. Why "from scratch" matters

Frameworks like Hugging Face Transformers are wonderful for shipping. They are bad teachers. When model = AutoModelForCausalLM.from_pretrained("gpt2") is one line of code, every concept underneath stays opaque.

Building from scratch is the opposite trade-off. It's slower, and the resulting model is much smaller than anything you'd ship. But the parts stop being magic. You can answer questions like "why is this softmax divided by √d_k?" with a memory of the moment you typed it.

2. The six parts of the course

The course is split into six parts. Each one ends with a runnable artifact saved into your local my-llm/ project.

  • Part I — Start the project (chapters 1–4). Tokens, bigrams, BPE tokenizer, embeddings. The first piece of a real LLM is a tokenizer trained on your data.
  • Part II — Make it learn (chapters 5–7). A single neuron with one trainable weight; then an MLP; then optimizers (SGD, momentum, Adam). You stop counting and start descending gradients.
  • Part III — Build the Transformer (chapters 8–10). An attention head, multi-head, residuals, LayerNorm, feed-forward — assembled into the block GPT stacks.
  • Part IV — Train and use the LLM (chapters 11–16). Real dataset, PyTorch model in 150 lines, training loop, sampling strategies, then GPT-2 124M weights loaded into the same architecture you wrote.
  • Part V — Make it useful (chapters 17–21). Instruction tuning, LoRA fine-tuning, INT8 quantization, a chat REPL with KV cache, and a capstone that ships a specialized assistant.
  • Part VI — Appendices (optional). Backprop derived by hand; RLHF and DPO explained.

3. What you'll have at the end

  • your own BPE tokenizer trained on your data
  • a Transformer block written by you in ~150 lines of PyTorch
  • a small GPT trained from scratch on your laptop
  • the same architecture loaded with real GPT-2 124M weights
  • an instruction-tuned chatbot you can talk to
  • all of it in a single my-llm/ project you own

No proprietary formats, no online dependencies — the artifacts are plain .py files and PyTorch checkpoints you can read, modify, and ship.

4. Prerequisites

  • comfort reading JavaScript or Python — the browser cells use JS, the local project uses Python
  • a laptop with Python 3.11+ (chapter 0 covers setup on Mac, Windows, Linux)
  • patience. You can finish in a weekend if you skip the deeper reading; a week if you take the time to actually play with each cell.

You do not need a math background beyond high-school algebra. The course derives the math inline when it matters and links to the appendix on backprop when it doesn't fit.

5. Start

Open chapter 1 — The dumbest model that exists. It's 18 minutes of reading plus a few minutes of running the cells, and it ends with you having shipped the first piece of my-llm/ — a bigram counter, the simplest working language model in existence. Every chapter after builds on that one.

If you'd rather see the full picture first, the complete chapter list is here.

Questions fréquentes

What does "from scratch" actually mean here?

Every part of the model — tokenizer, embeddings, attention, training loop — is something you write yourself. No black boxes, no `from transformers import GPT2Model`. You read PyTorch and you write PyTorch, but the architecture is yours.

How long does the whole course take?

Roughly 12 to 16 hours if you run every cell. Most chapters are 10–20 minutes of reading plus optional code time. You can pause between chapters.

Do I need GPU access?

No. The model and dataset are small enough that everything trains on a normal laptop CPU. A consumer GPU makes it faster; nothing more is required.

Why build from scratch instead of using an existing framework?

To learn. The point isn't to ship a benchmark winner — it's to understand each part well enough that a paper, a config, or a bug stops being opaque. Once you know what's inside, frameworks become tools instead of magic.

What model architecture does the course teach?

A GPT-style decoder-only Transformer. The same architecture as GPT-2 small (124M parameters) — and once you've built it, you can load OpenAI's published weights into your own code.