Guide
LoRA fine-tuning, built from scratch
Implement Low-Rank Adaptation (LoRA) from scratch and fine-tune a GPT-2 model. Math, code, and a working PyTorch example.
LoRA (Low-Rank Adaptation) is the most widely used parameter-efficient fine-tuning method in practice. It's also small enough to implement from scratch in ~30 lines. This guide walks through the math, the code, and where to use it.
1. The problem with full fine-tuning
Full fine-tuning updates every weight in the model. For GPT-2 small, that's 124M trainable parameters; for a 70B-parameter model, it's untenable without serious infrastructure. Worse, you end up with a separate full-size checkpoint for every fine-tuned variant.
LoRA's premise: fine-tuning rarely needs the full capacity of the model. The update — the difference between the base and the fine-tuned weights — is usually low-rank.
2. The math
For a frozen weight matrix W ∈ ℝ^(d × k), LoRA introduces two small matrices:
A ∈ ℝ^(r × k)
B ∈ ℝ^(d × r)
where r << min(d, k). The fine-tuned forward pass becomes:
y = (W + B·A) · x
W is frozen — only A and B are trained. At inference, you can either keep them separate (good for swapping adapters) or merge them: W' = W + B·A.
The parameter count drops from d·k (full fine-tuning) to r·(d + k) (LoRA). For typical r = 8, that's a 100×+ reduction.
3. Implementing it
LoRA in PyTorch is just a wrapper around nn.Linear:
class LoRALinear(nn.Module):
def __init__(self, base_linear, r=8, alpha=16):
super().__init__()
self.base = base_linear # frozen
d, k = base_linear.out_features, base_linear.in_features
self.A = nn.Parameter(torch.zeros(r, k))
self.B = nn.Parameter(torch.zeros(d, r))
nn.init.kaiming_uniform_(self.A)
# B starts at zero so the initial output equals the base
self.scale = alpha / r
for p in self.base.parameters():
p.requires_grad = False
def forward(self, x):
return self.base(x) + self.scale * (x @ self.A.t() @ self.B.t())That's the whole thing. Chapter 18 — LoRA fine-tuning builds this incrementally, with a working fine-tuning run on GPT-2 small.
4. Which layers should you LoRA?
In practice, LoRA is applied to the attention projections (W_Q, W_V, sometimes W_K and W_O). The FFN layers are sometimes included for harder tasks.
Empirically, LoRA on attention is enough for instruction tuning and most domain adaptations. Adding FFN doubles parameter count for marginal gains on common tasks.
5. Saving and loading adapters
A trained LoRA produces a small file — for GPT-2 small with rank 8, the adapter is a few hundred kilobytes. You can ship many adapters alongside one base model:
gpt2-small.bin # 500 MB base, shared
lora-finance.pt # 200 KB adapter
lora-medical.pt # 200 KB adapter
lora-customer-support.pt # 200 KB adapter
At runtime, load the base once and swap adapters cheaply.
6. Where this fits in the course
- Chapter 17 — Instruction tuning sets up the SFT recipe (chat template, loss masking, dataset format) that LoRA plugs into.
- Chapter 18 — Fine-tuning with LoRA walks through the implementation and runs a fine-tune.
- Chapter 21 — Capstone uses both SFT and (optionally) LoRA to ship a domain-specific model.
7. Going further
After basic LoRA you'll see variants in the wild:
- QLoRA: LoRA over a 4-bit quantized base model. Lets you fine-tune large models on consumer GPUs.
- DoRA: decomposes the LoRA update into magnitude and direction.
- Rank schedules: changing
rover training.
The base LoRA recipe is the foundation for all of them.
Questions fréquentes
What is LoRA?
Low-Rank Adaptation. Instead of updating every weight during fine-tuning, you add two small matrices that approximate the update — far fewer trainable parameters, almost the same result. The base model stays frozen.
How does LoRA actually work?
For each weight matrix W you want to fine-tune, you add a low-rank delta — two small matrices A and B such that ΔW = A · B. During the forward pass, the layer computes (W + A·B) · x. Only A and B are trained; W is frozen.
How many parameters does LoRA add?
For a typical rank-8 LoRA on GPT-2 small (124M parameters), around 100k–300k extra trainable parameters versus 124M for full fine-tuning — roughly 0.1–0.2%. The savings get more dramatic for larger models.
Why is the rank important?
The rank r controls how much capacity the LoRA adapter has. Too low and the fine-tune underfits. Too high and you lose the parameter savings. Common values are 4, 8, 16. Higher ranks help with more demanding tasks.
When should I use LoRA instead of full fine-tuning?
When you want to specialize a base model without paying for full retraining — when the base is large, when you'll have many different adapters, or when VRAM is the bottleneck. For a small model on a small dataset, full fine-tuning is often just as good.