Deconstructing ViTs: Part 3 - Patches vs Pixels

Introduction

Parts 1 and 2 covered the math and implementation: patch embeddings via Conv2d, manual multi-head self-attention, pre-norm Transformer blocks, and the full ViT assembly. Now the question: how does it actually perform?

We train our ViT-Tiny (1.2M params) on CIFAR-10 alongside a SimpleCNN (141K params) and our SmallResNet (175K params), all on a restricted 5,000-image subset for 30 epochs. The gap is large, and it lines up with the central finding of the original ViT paper.

Experimental Setup

Models

Model	Parameters	Inductive Bias
ViT-Tiny	1,205,898	None (learned)
SimpleCNN	141,354	Locality, equivariance
SmallResNet	175,258	Locality, residual

Training

Optimizer: AdamW (weight decay $= 0.01$). Learning rate: $10^{-3}$ with linear warmup (5 epochs for ViT, 3 for CNN) + cosine annealing. Loss: Cross-entropy with label smoothing ($\epsilon = 0.1$). 30 epochs on 5,000 training images, 1,000 test images.

Results

Model	Params	Best Test Acc	Final Train Acc
ViT-Tiny	1,205,898	49.60%	53.52%
SimpleCNN	141,354	70.80%	80.24%
SmallResNet	175,258	63.30%	69.06%

The SimpleCNN beats the ViT by 21.2 percentage points with $8.5\times$ fewer parameters. The SmallResNet lands at 63.30%, reaching that peak at epoch 28 (train loss 0.8683, test loss 1.0764).

Training Dynamics

CNN: Fast early progress -- 50% train accuracy by epoch 5, then a steady climb. The growing train-test gap (80% vs 71%) signals mild overfitting on the small dataset.
ViT: Barely above random at epoch 1 (~16%) during warmup. Gradual improvement through 30 epochs but never catches the CNN. The train-test gap stays small -- the model is underfitting, not overfitting.
SmallResNet: Starts slow (19.64% at epoch 1), crosses 50% train accuracy around epoch 10, and plateaus near 69% train / 63% test by epoch 27-28.

The failure modes are different. The CNN overfits (80% train vs 71% test). The ViT underfits (54% train vs 50% test) -- it cannot even fit the training set, despite having $8.5\times$ more parameters.

Attention Map Analysis

We extract attention weights from the last Transformer layer and look at which patches the CLS token attends to. Each of the 4 heads produces a $65 \times 65$ attention matrix; we take the CLS row (index 0) over the patch columns (indices 1-64) and reshape into an $8 \times 8$ spatial map.

Observations

Even after only 30 epochs on 5,000 images, the attention heads differentiate:

Some heads spread attention broadly, aggregating global context
Others focus on specific spatial regions
The distributions are non-uniform -- the model has learned that some patches matter more than others for a given input

This specialization across heads matches what has been observed in NLP Transformers. Different heads capture different types of patch relationships.

Positional Embedding Analysis

We compute cosine similarity between all pairs of learned positional embeddings. If the model has picked up spatial structure, nearby patches should have more similar embeddings than distant ones.

Similarity to Center Patch

Computing cosine similarity of every position to the center patch (index 32 in the $8 \times 8$ grid) produces an $8 \times 8$ heatmap showing highest similarity near the center and decreasing similarity with distance. The model has recovered 2D spatial locality from 1D positional indices.

This matches Dosovitskiy et al.'s finding. The model learns that patch 32 is close to patches 31 and 33 (horizontal neighbors) and patches 24 and 40 (vertical neighbors, 8 apart in the flattened sequence) -- all without being told the image is 2D.

Why ViTs Fail on Small Data

Our results reproduce the central claim of the ViT paper. It comes down to inductive bias.

What CNNs Know a Priori

Convolutional networks bake in two assumptions about images:

Locality: A $3 \times 3$ filter only sees nearby pixels. Edges and textures are local features.
Translation equivariance: The same filter applies at every spatial position. A cat in the top-left uses the same detector as a cat in the bottom-right.

These are correct for natural images and they shrink the hypothesis space. A CNN does not need to learn that nearby pixels are related -- the architecture enforces it.

What ViTs Must Learn from Data

A Vision Transformer starts with none of these built-in assumptions:

Every patch can attend to every other patch from layer 1 -- no locality constraint
Positional embeddings are initialized randomly -- spatial structure is not given
Attention patterns are input-dependent -- the model must figure out when to attend locally vs globally

This flexibility pays off with enough data (the original ViT trained on JFT-300M, 300 million images). With 5,000 images, the model cannot learn these spatial priors.

The Bias-Variance Tradeoff

In classical terms:

CNNs: High bias (strong assumptions), low variance $\rightarrow$ good on small data
ViTs: Low bias (few assumptions), high variance $\rightarrow$ need large data to generalize

This is why ViT-Large with JFT-300M pretraining surpasses the best CNNs, while our ViT-Tiny on 5,000 images loses by 21 points. The crossover -- where ViTs start outperforming CNNs -- sits around ImageNet scale (~1M images) with proper regularization.

Bridging the Gap

Several lines of work have tackled ViT's data efficiency problem:

DeiT (Touvron et al., 2021): A training recipe with heavy augmentation, knowledge distillation, and regularization lets ViT match CNNs using only ImageNet-1K
Hybrid models: CCT, CvT, and CoAtNet add convolutional stems or local attention to inject spatial priors
Masked pretraining: MAE (He et al., 2022) uses masked image modeling as a self-supervised pretext task for data-efficient ViT training

Conclusion

What this experiment shows:

The ViT architecture is simple -- patch embed, stack Transformer blocks, classify the CLS token -- but it is not a free lunch
Attention heads specialize spatially even with limited training data
1D positional embeddings learn 2D structure
On 5,000 images, data efficiency is the bottleneck: 49.60% (ViT) vs 70.80% (CNN) vs 63.30% (ResNet)

ViTs are not worse than CNNs. They just need more data. Inductive biases are priors, not limitations -- when the prior fits the data (spatial locality for natural images), it speeds up learning. When you have enough data to compensate for the lack of priors, the more flexible architecture wins.

Deconstructing ViTs from Scratch

Part 3: Patches vs Pixels