Skip to content
The loss curve

Chapter 4 · 16 min

Giving meaning to words

Give meaning to tokens. One-hot vectors, embeddings, cosine similarity, skip-gram training — and what "king − man + woman" really shows.

In chapter 3, you trained a . It hands you back integers — say, 17 for king and 248 for queen. The model still doesn't know that those two integers are related to each other.

A naive way to feed integers into a model is : 17 becomes a vector of 50,000 zeros with a single 1 at position 17. It works mechanically, but it's wasteful, and the dot product between any two distinct vectors is exactly zero. As far as the model can tell, king and queen are no more similar than king and bicycle.

We want each to live somewhere in a continuous space, and we want the geometry of that space to mean something. The technique that makes this work is : for every word in the , push its vector and its neighbors' vectors toward each other in the dot-product sense. Words used in similar contexts end up with similar vectors — even though we never told the model which words are related.

You're going to inspect from scratch. Four runnable cells. Then you will add a small module locally, because a starts with an table too.

1. Build the (center, context) pairs

trains on pairs: for every position in the , every neighbor inside a fixed window is a positive example. The model's job is to make the dot product of (center vector, context vector) large.

Walk the , and for each emit [centerIdx, contextIdx] for each neighbor inside the window.

Code · JavaScript

That list is the entire set for . Bigger window = each word sees more context = denser pair graph.

2. One SGD step

For a single (center, context) positive pair, we want the dot product u · v (where u = E[center], v = E[context]) to grow. Push both vectors toward each other.

Setting target = 1 (positive sample) and using the logistic , the update simplifies to:

=σ(uv)1,uuηv,vvηu\nabla = \sigma(u \cdot v) - 1, \quad u \leftarrow u - \eta\,\nabla\,v, \quad v \leftarrow v - \eta\,\nabla\,u

Where σ is the . The grad - 1 term is just "how much short of 1 are we", and the with respect to each vector is symmetric in the other.

Code · JavaScript

The dot-product bar should grow after the update. That's the whole signal: take a pair the told us belongs together, and pull their vectors closer in dot-product land.

3. Train and visualize

Now we chain everything: build pairs, init random , run thousands of steps. Each step picks a random pair from the list. After enough iterations, words that share contexts end up close to each other in 2D.

Code · JavaScript

Look at the scatter. With 3000 iterations on a small , you should see some clustering: pairs like man/woman, boy/girl, king/queen tend to land near each other. The clusters are noisy because the is tiny — real word2vec runs on billions of — but the shape of the result is the right shape. Words used in similar contexts have similar vectors.

4. Cosine similarity

Once you have vectors, "how similar are these two words" reduces to "how aligned are their vectors". The standard measure is :

cos(a,b)=abab\cos(a, b) = \frac{a \cdot b}{\|a\|\,\|b\|}

Value 1 means same direction, 0 means orthogonal, −1 means opposite. Magnitudes are usually ignored: it's the direction that carries meaning.

Write the function. The chapter feeds it pre-trained 8-dimensional (the higher dim gives sharper clusters) and lets you pick a query word — the bar plot shows the closest neighbors by your function's score.

Code · JavaScript

Try a few query words. With 4000 iterations, you should see some sensible neighbors — gendered words near each other, royal terms grouping together. Don't expect the magic king − man + woman ≈ queen analogy on this toy — that requires billions of — but the kind of structure that scales up to that is visible here in miniature.

What this fixes (and what it doesn't)

We've solved "all distinct words look unrelated". The model has a way to say "this is in the same neighborhood as those ", which was missing in chapters 1–2.

We have not solved everything:

  • Static vectors. bank (river) and bank (financial) get the same vector regardless of context. Disambiguation is the whole story of the next several chapters.
  • No word order. averages over neighborhoods; "dog bit man" and "man bit dog" produce identical statistics.
  • No actual prediction yet. We've decorated with vectors. We haven't built a model that uses those vectors to predict the next word.

5. Add embeddings to my-llm/

Create llm/embeddings.py:

"""Tiny embedding helpers before we switch to PyTorch."""
from __future__ import annotations
 
import math
import random
 
 
Vector = list[float]
Matrix = list[Vector]
 
 
# [1]
def init_embeddings(vocab_size: int, dim: int, scale: float = 0.01) -> Matrix:
    return [
        [random.uniform(-scale, scale) for _ in range(dim)]
        for _ in range(vocab_size)
    ]
 
 
# [2]
def dot(a: Vector, b: Vector) -> float:
    return sum(x * y for x, y in zip(a, b))
 
 
def sigmoid(x: float) -> float:
    return 1.0 / (1.0 + math.exp(-x))
 
 
def skipgram_step(E: Matrix, center: int, context: int, lr: float = 0.1) -> float:
    u = E[center][:]
    v = E[context][:]
    # [3]
    pred = sigmoid(dot(u, v))
    # [4]
    grad = pred - 1.0
 
    # [5]
    for i in range(len(u)):
        E[center][i] -= lr * grad * v[i]
        E[context][i] -= lr * grad * u[i]
 
    return -math.log(max(pred, 1e-9))
 
 
# [6]
def cosine(a: Vector, b: Vector) -> float:
    denom = math.sqrt(dot(a, a)) * math.sqrt(dot(b, b))
    return 0.0 if denom == 0 else dot(a, b) / denom

Read this as the smallest possible learning system:

  • [1] init_embeddings creates the table: one row per , one column per feature. At the start, the numbers are random and meaningless.
  • [2] dot is the model's score for “do these two vectors point in compatible directions?”
  • [3] sigmoid(dot(u, v)) turns that score into a probability-like number between 0 and 1.
  • [4] grad = pred - 1.0 is the mistake on a positive pair. If the model already predicts 0.95, the update is small. If it predicts 0.05, the update is large.
  • [5] moves the center vector and context vector toward each other.
  • [6] cosine is for inspection, not . It lets you ask which learned vectors point in similar directions.

This is not the final code. Later, PyTorch will own the table and . But the file establishes the contract: id in, dense vector out.

Recap

  • vectors are wasteful and assert that every word is unrelated to every other word. - are dense low-dimensional vectors per . Words used in similar contexts end up with similar vectors. - trains them: pull each (center, context) pair's vectors toward each other in dot-product land. on a . - is the standard "how alike are these " metric. - Your local project now has llm/embeddings.py, the first file with learned . - Higher dimensions let unrelated semantic axes (gender, age, formality, royal-ness, ...) coexist instead of stepping on each other.
  • alone don't predict anything. The next chapters wire them into something that does.

Going further

Next up: a neuron that learns give us inputs. Now we need a function that turns inputs into outputs and gets better at it.