Back to Projects

Transformers from Scratch

Explore the mathematics of multi-head self-attention, construct the pure PyTorch Encoder-Decoder architecture, and visualize the cross-attention tracking of sequences.

Part 1

The Math of Self-Attention

Deconstructing the mathematics of Queries, Keys, Values, Scaled Dot-Product Attention, and sine/cosine Positional Encodings.

Part 2

PyTorch Implementation

Translating self-attention equations into execution loops, managing causal masks, and assembling the Transformer from scratch.
View Code on GitHub

Part 3

Sequence Reversal & Attention

Benchmarking the model on a sequence reversal task to observe rapid loss convergence and examining dynamic routing inside the cross-attention tensors.