Deconstructing ViTs from Scratch
Explore the mathematics of patch embeddings, positional encodings, and pure-attention image classification — built entirely from scratch in pure PyTorch.
Part 1
The Math of Patch Embeddings
Breaking down image-to-sequence conversion, CLS token classification, learnable positional encodings, and why convolutions aren't necessary.
Part 2
PyTorch Implementation
Building PatchEmbedding, MultiHeadSelfAttention, TransformerBlocks, and VisionTransformer — entirely from scratch without timm.
View Code on GitHub
Part 3
Patches vs Pixels
ViT vs CNN on CIFAR-10, visualizing attention maps, and confirming that Vision Transformers need large datasets to outperform convolutional baselines.