In Parts 1 and 2, we derived and implemented a Vision Transformer from scratch: patch embeddings via Conv2d, multi-head self-attention without nn.MultiheadAttention, pre-norm Transformer blocks, and the full ViT assembly. Now we confront the critical question: how does it actually perform?
We train our ViT-Tiny on CIFAR-10 alongside a simple CNN baseline, both on a restricted 5,000-image subset for 30 epochs. The results reveal a fundamental tension between architectural flexibility and data efficiency -- and confirm a central finding of the original ViT paper.
Experimental Setup
Models
| Model | Parameters | Inductive Bias |
|---|---|---|
| ViT-Tiny | 1,205,898 | None (learned) |
| SimpleCNN | 141,354 | Locality, equivariance |
| SmallResNet | 175,258 | Locality, residual |
Training
Optimizer: AdamW (weight decay $= 0.01$). Learning rate: $10^{-3}$ with linear warmup (5 epochs for ViT, 3 for CNN) + cosine annealing. Loss: Cross-entropy with label smoothing ($\epsilon = 0.1$). 30 epochs on 5,000 training images, 1,000 test images.
Results
| Model | Params | Best Test Acc | Final Train Acc |
|---|---|---|---|
| ViT-Tiny | 1,205,898 | 49.60% | 53.52% |
| SimpleCNN | 141,354 | 70.80% | 80.24% |
| SmallResNet | 175,258 | 63.30% | 69.06% |
The CNN outperforms the ViT by 21.2 percentage points despite having $8.5\times$ fewer parameters.
Training Dynamics
- CNN: Fast initial learning, reaching 50% train accuracy by epoch 5. Continues to climb steadily, with a growing train-test gap indicating mild overfitting on the small dataset.
- ViT: Slow start during the warmup phase (~16% at epoch 1, barely above random). Gradual improvement throughout training but never achieves the CNN's learning rate. The train-test gap remains modest -- the model is underfitting, not overfitting.
This is a crucial distinction: the CNN's failure mode on small data is overfitting (80% train vs 70% test). The ViT's failure mode is underfitting (54% train vs 50% test) -- it cannot even fit the training data well, despite having $8.5\times$ more parameters.
Attention Map Analysis
We extract the attention weights from the last Transformer layer and examine which patches the CLS token attends to. Each of the 4 heads produces a $65 \times 65$ attention matrix; we take the CLS token's row (index 0) and the patch columns (indices 1-64), reshaping into an $8 \times 8$ spatial map.
Observations
Even with limited training, the attention heads show differentiated behavior:
- Some heads attend broadly across the image, aggregating global context
- Other heads show localized attention patterns, focusing on specific spatial regions
- The attention distributions are non-uniform -- the model has learned that some patches are more informative than others for the given input
This head specialization is consistent with findings from NLP Transformers and suggests that the multi-head mechanism is serving its intended purpose: different heads capture different types of relationships between patches.
Positional Embedding Analysis
We compute the cosine similarity between all pairs of learned positional embeddings. If the model has learned meaningful spatial structure, nearby patches should have more similar positional embeddings than distant ones.
Similarity to Center Patch
For a more intuitive visualization, we compute the cosine similarity of every patch position to the center patch (index 32 in our $8 \times 8$ grid). The resulting $8 \times 8$ heatmap shows:
- Highest similarity at and near the center position
- Decreasing similarity with increasing distance
- The model has discovered 2D spatial locality from 1D positional indices
This confirms Dosovitskiy et al.'s finding: even 1D positional embeddings can capture 2D spatial structure. The model learns that patch 32 is spatially close to patches 31 and 33 (horizontal neighbors) and patches 24 and 40 (vertical neighbors, 8 positions apart in the flattened sequence).
Why ViTs Fail on Small Data
Our results cleanly reproduce the central finding of the ViT paper. The explanation involves the concept of inductive bias.
What CNNs Know a Priori
Convolutional networks encode two strong assumptions about images:
- Locality: A $3 \times 3$ filter only sees nearby pixels. Important features (edges, textures) are local.
- Translation equivariance: The same filter is applied at every spatial location. A cat in the top-left looks the same as a cat in the bottom-right.
These assumptions are correct for natural images and dramatically reduce the hypothesis space. A CNN doesn't need to learn that nearby pixels are related -- it knows this architecturally.
What ViTs Must Learn from Data
A Vision Transformer has none of these built-in assumptions:
- Every patch can attend to every other patch from layer 1 -- there is no locality constraint
- Positional embeddings are learned from scratch -- spatial structure is not given
- The attention pattern is input-dependent -- the model must learn when to attend locally vs globally
This flexibility is powerful with sufficient data (ViT was originally trained on JFT-300M, a dataset of 300 million images). But with only 5,000 images, the model cannot learn these spatial priors effectively.
The Bias-Variance Tradeoff
In classical statistical terms:
- CNNs: High bias (strong assumptions), low variance $\rightarrow$ good on small data
- ViTs: Low bias (few assumptions), high variance $\rightarrow$ need large data to generalize
This is why ViT-Large with JFT-300M pretraining surpasses the best CNNs, but our ViT-Tiny on 5,000 images loses by 21 points. The crossover point -- where ViTs begin to outperform CNNs -- occurs somewhere around ImageNet scale (~1M images) with appropriate regularization.
Bridging the Gap
Several subsequent works have addressed ViT's data efficiency:
- DeiT (Touvron et al., 2021): Training recipe with strong data augmentation, knowledge distillation, and regularization enables ViT to match CNNs on ImageNet-1K alone
- Hybrid models: CCT, CvT, and CoAtNet introduce convolutional stems or local attention to inject inductive bias
- Masked pretraining: MAE (He et al., 2022) uses masked image modeling as a self-supervised pretext task, enabling data-efficient ViT training
Conclusion
Our from-scratch ViT implementation demonstrates both the elegance and the limitation of applying pure self-attention to images:
- The architecture is remarkably simple: patch embed, stack Transformer blocks, classify the CLS token
- Attention maps show meaningful spatial specialization even with limited training
- Positional embeddings learn 2D structure from 1D indices
- But data efficiency remains the fundamental bottleneck in the low-data regime
The lesson is not that ViTs are inferior to CNNs -- they are not. The lesson is that inductive biases are priors, not limitations. When your prior matches the data distribution (as spatial locality matches natural images), it accelerates learning. When you have enough data to overcome the lack of priors, the more flexible architecture wins.
This is arguably the deepest insight of the ViT paper: the boundary between "architecture design" and "learning from data" is a spectrum, not a binary choice. Vision Transformers shifted that boundary decisively toward data-driven learning, and the field has never looked back.