Back to ViTs Hub

Deconstructing ViTs from Scratch

Part 1: The Math of Patch Embeddings

Introduction

In 2020, Dosovitskiy et al. posed a provocative question: can a pure Transformer, applied directly to sequences of image patches, match or surpass convolutional neural networks at image classification? The answer -- Vision Transformer (ViT) -- demonstrated that with sufficient data, the same architecture that powers GPT and BERT can achieve state-of-the-art results in computer vision.

The key insight is deceptively simple: an image is just a sequence. If we can convert pixels into tokens, we can apply the full machinery of self-attention without any convolution, pooling, or spatial inductive bias. In this first part of a three-part "Build in Public" series, we derive the mathematical foundations: how images become sequences, how positional information is encoded, and how a single learnable token aggregates global information.

Image as Sequence

A standard Transformer encoder expects a sequence of token embeddings $\{z_i\}_{i=1}^{N}$, each of dimension $D$. For NLP, each $z_i$ is a word embedding. For vision, we need to convert a 2D image into a 1D sequence.

Given an image $\mathbf{x} \in \mathbb{R}^{H \times W \times C}$, ViT divides it into a grid of non-overlapping patches:

$$ \mathbf{x}_p^{(i)} \in \mathbb{R}^{P \times P \times C}, \quad i = 1, 2, \ldots, N, \quad N = \frac{HW}{P^2} $$

where $P$ is the patch size. For a $32 \times 32$ CIFAR-10 image with $P = 4$, we get $N = (32/4)^2 = 64$ patches, each of size $4 \times 4 \times 3 = 48$ values.

Each patch is flattened into a vector and linearly projected to dimension $D$:

$$ \mathbf{z}_i^{(0)} = \mathbf{x}_p^{(i)} \mathbf{E}, \quad \mathbf{E} \in \mathbb{R}^{(P^2 C) \times D} $$

The Patch Embedding

While the mathematical formulation uses a flatten-then-project approach, the implementation uses a more elegant trick: a single Conv2d with kernel_size=P and stride=P.

This convolution extracts non-overlapping $P \times P$ patches (because stride equals kernel size) and projects them to $D$ channels in one operation. The output tensor of shape $(B, D, H/P, W/P)$ is then flattened spatially to produce the sequence $(B, N, D)$.

Mathematically, this is identical to the flatten-project formulation -- the Conv2d weights encode the same linear projection $\mathbf{E}$ -- but it is computationally more efficient and leverages optimized convolution kernels.

The [CLS] Token

In BERT, a special [CLS] token is prepended to the input sequence, and its final representation serves as the aggregate sequence embedding for classification. ViT adopts the same strategy.

A learnable vector $\mathbf{z}_{\text{cls}} \in \mathbb{R}^D$ is prepended to the patch sequence:

$$ \mathbf{z}^{(0)} = [\mathbf{z}_{\text{cls}}; \; \mathbf{z}_1^{(0)}; \; \mathbf{z}_2^{(0)}; \; \ldots; \; \mathbf{z}_N^{(0)}] $$

The full sequence now has length $N + 1$. Through $L$ Transformer layers, the CLS token attends to all patch tokens and accumulates global information. After the final layer, only the CLS token output $\mathbf{z}_{\text{cls}}^{(L)}$ is passed to the classification head:

$$ \hat{y} = \text{MLP}_{\text{head}}(\text{LayerNorm}(\mathbf{z}_{\text{cls}}^{(L)})) $$

Why not just average all patch tokens? Both approaches work, but the CLS token has an appealing interpretation: it is a learnable "query" that asks "what class is this image?" and collects evidence from every patch through attention.

Positional Embeddings

Without convolutions, the Transformer has no notion of spatial order. Feeding patches in any permutation would yield the same output -- the model is permutation-equivariant by default. To inject spatial awareness, we add learnable positional embeddings:

$$ \mathbf{z}^{(0)} = [\mathbf{z}_{\text{cls}} + \mathbf{p}_0; \; \mathbf{z}_1^{(0)} + \mathbf{p}_1; \; \ldots; \; \mathbf{z}_N^{(0)} + \mathbf{p}_N] $$

where $\mathbf{p}_i \in \mathbb{R}^D$ are learnable parameters. Unlike the fixed sinusoidal encodings from "Attention Is All You Need," ViT uses fully learnable 1D positional embeddings.

What Do They Learn?

Dosovitskiy et al. showed that the learned positional embeddings exhibit a striking pattern: the cosine similarity between two position embeddings correlates with the 2D spatial distance between the corresponding patches. Nearby patches have similar embeddings; distant patches have dissimilar ones. The model discovers 2D spatial structure from 1D position indices -- without being told the image is 2D.

1D vs 2D Positional Embeddings

ViT uses 1D positional embeddings (a single index per patch) rather than 2D (row, column pairs). The authors found no significant benefit from 2D embeddings, suggesting that the model can infer the 2D grid structure from the 1D positions and the image content.

The Complete Embedding Pipeline

Putting it all together:

  1. Extract patches: Split image into $N = (H/P)^2$ non-overlapping patches using Conv2d
  2. Project: Each patch is linearly projected to dimension $D$
  3. Flatten: Spatial dimensions are collapsed to form a sequence of length $N$
  4. Prepend CLS: A learnable CLS token is prepended, making the sequence length $N+1$
  5. Add positions: Learnable positional embeddings are added element-wise

The output is a tensor of shape $(B, N+1, D)$ -- a batch of sequences ready for the Transformer encoder. For our CIFAR-10 configuration ($32 \times 32$ images, $P=4$, $D=128$), this is $(B, 65, 128)$: 65 tokens of 128 dimensions.

Comparison to CNNs

The patch embedding reveals a fundamental philosophical difference between ViTs and CNNs:

CNNs impose strong inductive biases: locality (small filters see nearby pixels) and translation equivariance (the same filter is applied everywhere). These are correct assumptions for images and allow CNNs to learn efficiently from limited data. ViTs make no such assumptions -- they must learn locality and spatial structure from scratch, which requires vastly more data.

Next Steps: From Patches to Attention

We have established the mathematical foundation: how an image becomes a sequence of patch tokens with positional awareness and a global aggregation mechanism (CLS token).

In Part 2, we implement this exact pipeline in pure PyTorch, then build the Multi-Head Self-Attention mechanism from scratch (no nn.MultiheadAttention), the Transformer encoder block with pre-norm residual connections, and the complete ViT architecture.

Stay tuned for the code drop as we build Vision Transformers from scratch!