Transformers from Scratch
Explore the mathematics of multi-head self-attention, construct the pure PyTorch Encoder-Decoder architecture, and visualize the cross-attention tracking of sequences.
Part 1
The Math of Self-Attention
Deconstructing the mathematics of Queries, Keys, Values, Scaled Dot-Product Attention, and sine/cosine Positional Encodings.
Part 2
PyTorch Implementation
Translating self-attention equations into execution loops, managing causal masks, and assembling the Transformer from scratch.
View Code on GitHub
Part 3
Sequence Reversal & Attention
Benchmarking the model on a sequence reversal task to observe rapid loss convergence and examining dynamic routing inside the cross-attention tensors.