Back to ViTs Hub

Deconstructing ViTs from Scratch

Part 1: The Math of Patch Embeddings

Introduction

In 2020, Dosovitskiy et al. asked whether a pure Transformer, applied directly to sequences of image patches, could match convolutional neural networks at image classification. The resulting Vision Transformer (ViT) showed that with enough data, the same architecture behind GPT and BERT reaches state-of-the-art accuracy in computer vision.

The core idea: treat an image as a sequence. Convert pixels into tokens, then apply self-attention -- no convolutions, no pooling, no hard-coded spatial assumptions. This post derives the math: how images become sequences, how positional information gets encoded, and how a single learnable token aggregates global information for classification.

Image as Sequence

A Transformer encoder expects a sequence of token embeddings $\{z_i\}_{i=1}^{N}$, each of dimension $D$. In NLP, each $z_i$ is a word embedding. For vision, we need to turn a 2D image into a 1D sequence.

Given an image $\mathbf{x} \in \mathbb{R}^{H \times W \times C}$, ViT divides it into a grid of non-overlapping patches:

$$ \mathbf{x}_p^{(i)} \in \mathbb{R}^{P \times P \times C}, \quad i = 1, 2, \ldots, N, \quad N = \frac{HW}{P^2} $$

where $P$ is the patch size. For a $32 \times 32$ CIFAR-10 image with $P = 4$, we get $N = (32/4)^2 = 64$ patches, each containing $4 \times 4 \times 3 = 48$ values.

Each patch is flattened into a vector and linearly projected to dimension $D$:

$$ \mathbf{z}_i^{(0)} = \mathbf{x}_p^{(i)} \mathbf{E}, \quad \mathbf{E} \in \mathbb{R}^{(P^2 C) \times D} $$

The Patch Embedding

In practice, the flatten-then-project step is implemented as a single Conv2d with kernel_size=P and stride=P.

Because stride equals kernel size, the convolution extracts non-overlapping $P \times P$ patches and projects them to $D$ channels in one shot. The output tensor has shape $(B, D, H/P, W/P)$, which gets flattened spatially to $(B, N, D)$.

This is mathematically identical to the flatten-project formulation -- the Conv2d weights encode the same linear projection $\mathbf{E}$ -- but it runs faster by leveraging optimized convolution kernels.

The [CLS] Token

Borrowed from BERT: a special [CLS] token is prepended to the input sequence, and its final representation serves as the aggregate embedding for classification.

A learnable vector $\mathbf{z}_{\text{cls}} \in \mathbb{R}^D$ is prepended to the patch sequence:

$$ \mathbf{z}^{(0)} = [\mathbf{z}_{\text{cls}}; \; \mathbf{z}_1^{(0)}; \; \mathbf{z}_2^{(0)}; \; \ldots; \; \mathbf{z}_N^{(0)}] $$

The sequence now has length $N + 1$. Through $L$ Transformer layers, the CLS token attends to every patch token and accumulates global information. After the final layer, only its output $\mathbf{z}_{\text{cls}}^{(L)}$ feeds into the classification head:

$$ \hat{y} = \text{MLP}_{\text{head}}(\text{LayerNorm}(\mathbf{z}_{\text{cls}}^{(L)})) $$

You could also just average all patch tokens (global average pooling). Both work. But the CLS token has a useful interpretation: it acts as a learnable query -- "what class is this image?" -- and collects evidence from every patch via attention.

Positional Embeddings

Without convolutions, the Transformer has no notion of spatial order. Feeding patches in any permutation produces the same output -- the model is permutation-equivariant by default. To fix this, we add learnable positional embeddings:

$$ \mathbf{z}^{(0)} = [\mathbf{z}_{\text{cls}} + \mathbf{p}_0; \; \mathbf{z}_1^{(0)} + \mathbf{p}_1; \; \ldots; \; \mathbf{z}_N^{(0)} + \mathbf{p}_N] $$

where $\mathbf{p}_i \in \mathbb{R}^D$ are learnable parameters. Unlike the fixed sinusoidal encodings from "Attention Is All You Need," ViT uses fully learnable 1D positional embeddings.

What Do They Learn?

Dosovitskiy et al. showed that the cosine similarity between two learned position embeddings correlates with the 2D spatial distance between the corresponding patches. Nearby patches end up with similar embeddings; distant patches get dissimilar ones. The model recovers 2D spatial structure from 1D position indices, without being told the image is 2D.

1D vs 2D Positional Embeddings

ViT uses 1D positional embeddings (one index per patch) rather than 2D (row, column pairs). The authors found no significant benefit from 2D embeddings -- the model can apparently infer the 2D grid structure from 1D positions and the image content alone.

The Complete Embedding Pipeline

Putting it together:

  1. Extract patches: Split image into $N = (H/P)^2$ non-overlapping patches using Conv2d
  2. Project: Each patch is linearly projected to dimension $D$
  3. Flatten: Spatial dimensions collapse to a sequence of length $N$
  4. Prepend CLS: A learnable CLS token is prepended, giving sequence length $N+1$
  5. Add positions: Learnable positional embeddings are added element-wise

The output is a tensor of shape $(B, N+1, D)$ -- a batch of sequences ready for the Transformer encoder. For our CIFAR-10 setup ($32 \times 32$ images, $P=4$, $D=128$), that is $(B, 65, 128)$: 65 tokens of 128 dimensions each.

Comparison to CNNs

The patch embedding exposes a real architectural divide:

CNNs impose strong inductive biases: locality (small filters see nearby pixels) and translation equivariance (the same filter applies everywhere). These assumptions hold for natural images and let CNNs learn efficiently from limited data. ViTs make none of these assumptions -- they have to learn locality and spatial structure from scratch, which demands much more data.

Next: From Patches to Attention

That covers the mathematical foundation: how an image becomes a sequence of patch tokens with positional awareness and a global aggregation mechanism (the CLS token).

In Part 2, we implement this pipeline in pure PyTorch, then build Multi-Head Self-Attention from scratch (no nn.MultiheadAttention), the Transformer encoder block with pre-norm residual connections, and the complete ViT architecture.