Back to ViTs Hub

Deconstructing ViTs from Scratch

Part 3: Patches vs Pixels

Introduction

In Parts 1 and 2, we derived and implemented a Vision Transformer from scratch: patch embeddings via Conv2d, multi-head self-attention without nn.MultiheadAttention, pre-norm Transformer blocks, and the full ViT assembly. Now we confront the critical question: how does it actually perform?

We train our ViT-Tiny on CIFAR-10 alongside a simple CNN baseline, both on a restricted 5,000-image subset for 30 epochs. The results reveal a fundamental tension between architectural flexibility and data efficiency -- and confirm a central finding of the original ViT paper.

Experimental Setup

Models

Model Parameters Inductive Bias
ViT-Tiny1,205,898None (learned)
SimpleCNN141,354Locality, equivariance
SmallResNet175,258Locality, residual

Training

Optimizer: AdamW (weight decay $= 0.01$). Learning rate: $10^{-3}$ with linear warmup (5 epochs for ViT, 3 for CNN) + cosine annealing. Loss: Cross-entropy with label smoothing ($\epsilon = 0.1$). 30 epochs on 5,000 training images, 1,000 test images.

Results

Model Params Best Test Acc Final Train Acc
ViT-Tiny1,205,89849.60%53.52%
SimpleCNN141,35470.80%80.24%
SmallResNet175,25863.30%69.06%

The CNN outperforms the ViT by 21.2 percentage points despite having $8.5\times$ fewer parameters.

Training Dynamics

This is a crucial distinction: the CNN's failure mode on small data is overfitting (80% train vs 70% test). The ViT's failure mode is underfitting (54% train vs 50% test) -- it cannot even fit the training data well, despite having $8.5\times$ more parameters.

Attention Map Analysis

We extract the attention weights from the last Transformer layer and examine which patches the CLS token attends to. Each of the 4 heads produces a $65 \times 65$ attention matrix; we take the CLS token's row (index 0) and the patch columns (indices 1-64), reshaping into an $8 \times 8$ spatial map.

Observations

Even with limited training, the attention heads show differentiated behavior:

This head specialization is consistent with findings from NLP Transformers and suggests that the multi-head mechanism is serving its intended purpose: different heads capture different types of relationships between patches.

Positional Embedding Analysis

We compute the cosine similarity between all pairs of learned positional embeddings. If the model has learned meaningful spatial structure, nearby patches should have more similar positional embeddings than distant ones.

Similarity to Center Patch

For a more intuitive visualization, we compute the cosine similarity of every patch position to the center patch (index 32 in our $8 \times 8$ grid). The resulting $8 \times 8$ heatmap shows:

This confirms Dosovitskiy et al.'s finding: even 1D positional embeddings can capture 2D spatial structure. The model learns that patch 32 is spatially close to patches 31 and 33 (horizontal neighbors) and patches 24 and 40 (vertical neighbors, 8 positions apart in the flattened sequence).

Why ViTs Fail on Small Data

Our results cleanly reproduce the central finding of the ViT paper. The explanation involves the concept of inductive bias.

What CNNs Know a Priori

Convolutional networks encode two strong assumptions about images:

  1. Locality: A $3 \times 3$ filter only sees nearby pixels. Important features (edges, textures) are local.
  2. Translation equivariance: The same filter is applied at every spatial location. A cat in the top-left looks the same as a cat in the bottom-right.

These assumptions are correct for natural images and dramatically reduce the hypothesis space. A CNN doesn't need to learn that nearby pixels are related -- it knows this architecturally.

What ViTs Must Learn from Data

A Vision Transformer has none of these built-in assumptions:

This flexibility is powerful with sufficient data (ViT was originally trained on JFT-300M, a dataset of 300 million images). But with only 5,000 images, the model cannot learn these spatial priors effectively.

The Bias-Variance Tradeoff

In classical statistical terms:

This is why ViT-Large with JFT-300M pretraining surpasses the best CNNs, but our ViT-Tiny on 5,000 images loses by 21 points. The crossover point -- where ViTs begin to outperform CNNs -- occurs somewhere around ImageNet scale (~1M images) with appropriate regularization.

Bridging the Gap

Several subsequent works have addressed ViT's data efficiency:

Conclusion

Our from-scratch ViT implementation demonstrates both the elegance and the limitation of applying pure self-attention to images:

The lesson is not that ViTs are inferior to CNNs -- they are not. The lesson is that inductive biases are priors, not limitations. When your prior matches the data distribution (as spatial locality matches natural images), it accelerates learning. When you have enough data to overcome the lack of priors, the more flexible architecture wins.

This is arguably the deepest insight of the ViT paper: the boundary between "architecture design" and "learning from data" is a spectrum, not a binary choice. Vision Transformers shifted that boundary decisively toward data-driven learning, and the field has never looked back.