Deconstructing ViTs: Part 3 - Patches vs Pixels

Introduction

In Parts 1 and 2, we derived and implemented a Vision Transformer from scratch: patch embeddings via Conv2d, multi-head self-attention without nn.MultiheadAttention, pre-norm Transformer blocks, and the full ViT assembly. Now we confront the critical question: how does it actually perform?

We train our ViT-Tiny on CIFAR-10 alongside a simple CNN baseline, both on a restricted 5,000-image subset for 30 epochs. The results reveal a fundamental tension between architectural flexibility and data efficiency -- and confirm a central finding of the original ViT paper.

Experimental Setup

Models

Model	Parameters	Inductive Bias
ViT-Tiny	1,205,898	None (learned)
SimpleCNN	141,354	Locality, equivariance
SmallResNet	175,258	Locality, residual

Training

Optimizer: AdamW (weight decay $= 0.01$). Learning rate: $10^{-3}$ with linear warmup (5 epochs for ViT, 3 for CNN) + cosine annealing. Loss: Cross-entropy with label smoothing ($\epsilon = 0.1$). 30 epochs on 5,000 training images, 1,000 test images.

Results

Model	Params	Best Test Acc	Final Train Acc
ViT-Tiny	1,205,898	49.60%	53.52%
SimpleCNN	141,354	70.80%	80.24%
SmallResNet	175,258	63.30%	69.06%

The CNN outperforms the ViT by 21.2 percentage points despite having $8.5\times$ fewer parameters.

Training Dynamics

CNN: Fast initial learning, reaching 50% train accuracy by epoch 5. Continues to climb steadily, with a growing train-test gap indicating mild overfitting on the small dataset.
ViT: Slow start during the warmup phase (~16% at epoch 1, barely above random). Gradual improvement throughout training but never achieves the CNN's learning rate. The train-test gap remains modest -- the model is underfitting, not overfitting.

This is a crucial distinction: the CNN's failure mode on small data is overfitting (80% train vs 70% test). The ViT's failure mode is underfitting (54% train vs 50% test) -- it cannot even fit the training data well, despite having $8.5\times$ more parameters.

Attention Map Analysis

We extract the attention weights from the last Transformer layer and examine which patches the CLS token attends to. Each of the 4 heads produces a $65 \times 65$ attention matrix; we take the CLS token's row (index 0) and the patch columns (indices 1-64), reshaping into an $8 \times 8$ spatial map.

Observations

Even with limited training, the attention heads show differentiated behavior:

Some heads attend broadly across the image, aggregating global context
Other heads show localized attention patterns, focusing on specific spatial regions
The attention distributions are non-uniform -- the model has learned that some patches are more informative than others for the given input

This head specialization is consistent with findings from NLP Transformers and suggests that the multi-head mechanism is serving its intended purpose: different heads capture different types of relationships between patches.

Positional Embedding Analysis

We compute the cosine similarity between all pairs of learned positional embeddings. If the model has learned meaningful spatial structure, nearby patches should have more similar positional embeddings than distant ones.

Similarity to Center Patch

For a more intuitive visualization, we compute the cosine similarity of every patch position to the center patch (index 32 in our $8 \times 8$ grid). The resulting $8 \times 8$ heatmap shows:

Highest similarity at and near the center position
Decreasing similarity with increasing distance
The model has discovered 2D spatial locality from 1D positional indices

This confirms Dosovitskiy et al.'s finding: even 1D positional embeddings can capture 2D spatial structure. The model learns that patch 32 is spatially close to patches 31 and 33 (horizontal neighbors) and patches 24 and 40 (vertical neighbors, 8 positions apart in the flattened sequence).

Why ViTs Fail on Small Data

Our results cleanly reproduce the central finding of the ViT paper. The explanation involves the concept of inductive bias.

What CNNs Know a Priori

Convolutional networks encode two strong assumptions about images:

Locality: A $3 \times 3$ filter only sees nearby pixels. Important features (edges, textures) are local.
Translation equivariance: The same filter is applied at every spatial location. A cat in the top-left looks the same as a cat in the bottom-right.

These assumptions are correct for natural images and dramatically reduce the hypothesis space. A CNN doesn't need to learn that nearby pixels are related -- it knows this architecturally.

What ViTs Must Learn from Data

A Vision Transformer has none of these built-in assumptions:

Every patch can attend to every other patch from layer 1 -- there is no locality constraint
Positional embeddings are learned from scratch -- spatial structure is not given
The attention pattern is input-dependent -- the model must learn when to attend locally vs globally

This flexibility is powerful with sufficient data (ViT was originally trained on JFT-300M, a dataset of 300 million images). But with only 5,000 images, the model cannot learn these spatial priors effectively.

The Bias-Variance Tradeoff

In classical statistical terms:

CNNs: High bias (strong assumptions), low variance $\rightarrow$ good on small data
ViTs: Low bias (few assumptions), high variance $\rightarrow$ need large data to generalize

This is why ViT-Large with JFT-300M pretraining surpasses the best CNNs, but our ViT-Tiny on 5,000 images loses by 21 points. The crossover point -- where ViTs begin to outperform CNNs -- occurs somewhere around ImageNet scale (~1M images) with appropriate regularization.

Bridging the Gap

Several subsequent works have addressed ViT's data efficiency:

DeiT (Touvron et al., 2021): Training recipe with strong data augmentation, knowledge distillation, and regularization enables ViT to match CNNs on ImageNet-1K alone
Hybrid models: CCT, CvT, and CoAtNet introduce convolutional stems or local attention to inject inductive bias
Masked pretraining: MAE (He et al., 2022) uses masked image modeling as a self-supervised pretext task, enabling data-efficient ViT training

Conclusion

Our from-scratch ViT implementation demonstrates both the elegance and the limitation of applying pure self-attention to images:

The architecture is remarkably simple: patch embed, stack Transformer blocks, classify the CLS token
Attention maps show meaningful spatial specialization even with limited training
Positional embeddings learn 2D structure from 1D indices
But data efficiency remains the fundamental bottleneck in the low-data regime

The lesson is not that ViTs are inferior to CNNs -- they are not. The lesson is that inductive biases are priors, not limitations. When your prior matches the data distribution (as spatial locality matches natural images), it accelerates learning. When you have enough data to overcome the lack of priors, the more flexible architecture wins.

This is arguably the deepest insight of the ViT paper: the boundary between "architecture design" and "learning from data" is a spectrum, not a binary choice. Vision Transformers shifted that boundary decisively toward data-driven learning, and the field has never looked back.

Deconstructing ViTs from Scratch

Part 3: Patches vs Pixels