Back to ViTs Hub

Deconstructing ViTs from Scratch

Part 3: Patches vs Pixels

Introduction

Parts 1 and 2 covered the math and implementation: patch embeddings via Conv2d, manual multi-head self-attention, pre-norm Transformer blocks, and the full ViT assembly. Now the question: how does it actually perform?

We train our ViT-Tiny (1.2M params) on CIFAR-10 alongside a SimpleCNN (141K params) and our SmallResNet (175K params), all on a restricted 5,000-image subset for 30 epochs. The gap is large, and it lines up with the central finding of the original ViT paper.

Experimental Setup

Models

Model Parameters Inductive Bias
ViT-Tiny1,205,898None (learned)
SimpleCNN141,354Locality, equivariance
SmallResNet175,258Locality, residual

Training

Optimizer: AdamW (weight decay $= 0.01$). Learning rate: $10^{-3}$ with linear warmup (5 epochs for ViT, 3 for CNN) + cosine annealing. Loss: Cross-entropy with label smoothing ($\epsilon = 0.1$). 30 epochs on 5,000 training images, 1,000 test images.

Results

Model Params Best Test Acc Final Train Acc
ViT-Tiny1,205,89849.60%53.52%
SimpleCNN141,35470.80%80.24%
SmallResNet175,25863.30%69.06%

The SimpleCNN beats the ViT by 21.2 percentage points with $8.5\times$ fewer parameters. The SmallResNet lands at 63.30%, reaching that peak at epoch 28 (train loss 0.8683, test loss 1.0764).

Training Dynamics

The failure modes are different. The CNN overfits (80% train vs 71% test). The ViT underfits (54% train vs 50% test) -- it cannot even fit the training set, despite having $8.5\times$ more parameters.

Attention Map Analysis

We extract attention weights from the last Transformer layer and look at which patches the CLS token attends to. Each of the 4 heads produces a $65 \times 65$ attention matrix; we take the CLS row (index 0) over the patch columns (indices 1-64) and reshape into an $8 \times 8$ spatial map.

Observations

Even after only 30 epochs on 5,000 images, the attention heads differentiate:

This specialization across heads matches what has been observed in NLP Transformers. Different heads capture different types of patch relationships.

Positional Embedding Analysis

We compute cosine similarity between all pairs of learned positional embeddings. If the model has picked up spatial structure, nearby patches should have more similar embeddings than distant ones.

Similarity to Center Patch

Computing cosine similarity of every position to the center patch (index 32 in the $8 \times 8$ grid) produces an $8 \times 8$ heatmap showing highest similarity near the center and decreasing similarity with distance. The model has recovered 2D spatial locality from 1D positional indices.

This matches Dosovitskiy et al.'s finding. The model learns that patch 32 is close to patches 31 and 33 (horizontal neighbors) and patches 24 and 40 (vertical neighbors, 8 apart in the flattened sequence) -- all without being told the image is 2D.

Why ViTs Fail on Small Data

Our results reproduce the central claim of the ViT paper. It comes down to inductive bias.

What CNNs Know a Priori

Convolutional networks bake in two assumptions about images:

  1. Locality: A $3 \times 3$ filter only sees nearby pixels. Edges and textures are local features.
  2. Translation equivariance: The same filter applies at every spatial position. A cat in the top-left uses the same detector as a cat in the bottom-right.

These are correct for natural images and they shrink the hypothesis space. A CNN does not need to learn that nearby pixels are related -- the architecture enforces it.

What ViTs Must Learn from Data

A Vision Transformer starts with none of these built-in assumptions:

This flexibility pays off with enough data (the original ViT trained on JFT-300M, 300 million images). With 5,000 images, the model cannot learn these spatial priors.

The Bias-Variance Tradeoff

In classical terms:

This is why ViT-Large with JFT-300M pretraining surpasses the best CNNs, while our ViT-Tiny on 5,000 images loses by 21 points. The crossover -- where ViTs start outperforming CNNs -- sits around ImageNet scale (~1M images) with proper regularization.

Bridging the Gap

Several lines of work have tackled ViT's data efficiency problem:

Conclusion

What this experiment shows:

ViTs are not worse than CNNs. They just need more data. Inductive biases are priors, not limitations -- when the prior fits the data (spatial locality for natural images), it speeds up learning. When you have enough data to compensate for the lack of priors, the more flexible architecture wins.