Back to Projects

Deconstructing ViTs from Scratch

Explore the mathematics of patch embeddings, positional encodings, and pure-attention image classification — built entirely from scratch in pure PyTorch.

Part 1

The Math of Patch Embeddings

Breaking down image-to-sequence conversion, CLS token classification, learnable positional encodings, and why convolutions aren't necessary.

Part 2

PyTorch Implementation

Building PatchEmbedding, MultiHeadSelfAttention, TransformerBlocks, and VisionTransformer — entirely from scratch without timm.
View Code on GitHub

Part 3

Patches vs Pixels

ViT vs CNN on CIFAR-10, visualizing attention maps, and confirming that Vision Transformers need large datasets to outperform convolutional baselines.