Adam
The most popular optimizer in modern deep learning. Maintains per-parameter running averages of gradient (m) and squared gradient (v), normalizes the step by √v.
Combines momentum (first moment) with per-dimension adaptive scaling (second moment). Bias-corrects both. Default optimizer for transformers. AdamW is Adam with weight decay applied outside the gradient.
Continue