Back to CLIP Hub

Deconstructing CLIP

Part 3: 100% Zero-Shot Classification in 14 Seconds

Setup Recap

Task: 16-way zero-shot classification of $32 \times 32$ RGB images. Random baseline: $6.25\%$. Data: $3{,}200$ synthetic image-caption pairs (4 colors $\times$ 4 shapes $\times$ 200 examples), split $85/15$ into $2{,}720$ train and $480$ held out. Model: $74{,}913$ trainable parameters total — a small CNN image encoder, a token-pool text encoder, and a learnable temperature scalar. Training: AdamW with $\eta = 5 \times 10^{-4}$, weight decay $0.05$, cosine schedule over $30$ epochs, batch size $64$. Hardware: Apple M-series MPS. Total wall-clock: $13.9$ seconds.

The Training Trajectory

Per-epoch InfoNCE loss and held-out zero-shot accuracy:

EpochInfoNCE lossZero-shot accuracy (16-way)
$1$$3.5117$$27.08\%$
$4$$2.6948$$41.67\%$
$8$$2.2163$$67.71\%$
$12$$1.8234$$91.67\%$
$17$$1.5611$$100.00\%$
$30$$1.5014$$100.00\%$ (held)

Notice that epoch $1$ already reaches $27.1\%$ zero-shot accuracy — over four times the random baseline. This is unusual for image classification at one epoch in. The reason is the density of the contrastive signal: every training batch is itself a $64$-way classification problem (which of these 64 captions matches which of these 64 images?). One gradient step delivers a lot of supervision per parameter.

By epoch $12$ the model is at $91.7\%$ accuracy; by epoch $17$ it is at $100\%$ on every held-out image. The remaining $13$ epochs do nothing measurable to accuracy but continue to slightly tighten the embedding geometry — the loss continues descending from $1.56$ to $1.50$. This is consistent with the model continuing to refine cluster boundaries even after classification accuracy saturates.

What the Embedding Space Looks Like

A t-SNE projection of the held-out image embeddings (small dots) together with the $16$ class-caption embeddings (large stars), colored by class, reveals the structure that the contrastive loss carved out. There are exactly $16$ tight clusters. Each cluster contains one star (the class caption) at its centre, surrounded by approximately $30$ dots (the test images of that class). There are no class-mixing regions and no outliers.

This is a strong qualitative result. The model has discovered $16$ distinct classes purely from the structure of (image, caption) pairs — and crucially, it discovered the factorisation of those classes into (color $\times$ shape). The embedding space arranges classes such that "red circle" and "red square" are closer to each other than to "blue triangle" — even though the model was never told that color is a separable attribute.

Where did this structure come from? The model encoded "red" in some direction of the embedding space (because all four "red X" captions share that token and need to be close to the corresponding images). It encoded "circle" in another direction (similarly). The resulting embedding space is roughly a $16$-class structure organised by a $4 \times 4$ (color, shape) grid. None of this factorisation was supervised — it emerged from the contrastive objective acting on the (image, caption) pairing alone.

This kind of compositional structure is the hallmark of representation learning. It is also what makes downstream "zero-shot" tasks work: a new caption like "a red triangle" lives in a predictable direction of the learned space, even if that specific caption never appeared in training data.

The Similarity Matrix on a Held-Out Batch

The image-caption similarity matrix on a held-out batch of $16$ test images (one per class) compared against the $16$ class-caption embeddings reveals the expected pattern: each row's maximum is on the diagonal (the matching class) with a clear margin to the runners-up.

The off-diagonal structure is informative. For a "red circle" test image, the second-highest similarity is typically with "red square" or "red triangle" (same color, different shape) or with "blue circle" or "green circle" (same shape, different color). The third-highest is the "opposite" class — completely different color and shape. This tells us the model encodes color and shape as somewhat independent dimensions of the embedding space, with both contributing additively to similarity.

This qualitative result is what you cannot get from a benchmark accuracy number alone. The model is not just $100\%$ correct — it is correct for the right reasons, with internal representations that align with how a human would describe the classes.

Zero-Shot Is Not Shotless

The "zero-shot" framing means we did not train a classifier head. The classification at inference time uses only the trained encoders plus a similarity computation. No softmax-over-class-indices layer, no fitted MLP head, no class-specific weights.

But we did train the model on pairs of (image, caption), and the captions encode the class structure. The model learned to put image $i$ near caption $i$ in embedding space; classification falls out of that as a free byproduct. If we had trained on pairs where the captions didn't reveal class structure (say, all captions were identical, or all were random gibberish), we would have learned a useless embedding space.

This is the key reframe of CLIP. Zero-shot classification is not the model recognising classes it has never seen. It is the model exploiting the natural-language structure that was already in the captions during training. When the original OpenAI CLIP paper got $76\%$ zero-shot accuracy on ImageNet, it wasn't magic emergence — it was because ImageNet class names ("golden retriever", "syringe") appeared often enough in the LAION-400M captions that the model learned useful representations for them.

The corollary: CLIP is bad at zero-shot tasks where the class names were not present in training captions. Medical imaging (very domain-specific terms underrepresented in web captions), legal documents, specialised scientific notation — these are CLIP's blind spots, not because the algorithm is weak but because the training corpus didn't contain the natural-language framing those domains need.

Why the Batch Size Trick Matters

In each batch of $B$ pairs, each image is contrasted against $B - 1$ negative captions. A model trained at $B = 64$ sees $63$ negative comparisons per example per step. OpenAI's CLIP trained at $B = 32{,}768$. That is roughly $500\times$ more negative comparisons per gradient step.

This is empirically the most important hyperparameter for CLIP-style training. Larger batches give:

More negatives per positive. The discriminative task gets harder: the model has to find the matching caption among many candidates rather than among few. Harder discrimination forces sharper embeddings.

More gradient signal per step. With more pairwise comparisons informing each update, the optimizer extracts more information from each batch.

More chance of seeing semantically-similar hard negatives in the same batch. A batch of $32{,}768$ likely contains multiple semantically related images (different dogs, different cars, different colors of the same shape). Distinguishing between these "hard negatives" is what teaches the embedding space its fine-grained structure.

This is also why CLIP-style training scales so well with compute. Doubling the batch size effectively doubles the work per gradient step. Combined with the natural parallelism of contrastive computation, CLIP gets near-linear speedup from data-parallel training across many GPUs — much better than most architectures.

Generalisation: What Scales

To scale our experiment to OpenAI's CLIP, several knobs would change:

Bigger image encoder. ViT-B/32 ($\sim 86$M parameters) or ResNet-50 ($\sim 25$M) instead of our small CNN ($\sim 40$K). The algorithm is identical; the encoder just has more capacity to represent the visual diversity of $400$M images.

Bigger text encoder. A $12$-layer Transformer ($\sim 60$M parameters) instead of our token-pool ($\sim 10$K). At natural-language scale with variable-length captions, full Transformer attention captures dependencies that mean-pool cannot.

Bigger captions. Variable-length natural language, vocabulary size $\sim 50{,}000$ (BPE), captions ranging from one word to several sentences. Our toy uses $10$ tokens and 4-token captions.

Bigger batches. $32{,}768$ instead of $64$. This is $500\times$ more negatives per positive, requiring a multi-GPU all-gather to assemble the batch's embeddings.

Bigger corpus. WebImageText (400M pairs) instead of our $3{,}200$. This is $125{,}000\times$ more examples, and the diversity makes the difference between memorising classes and learning a general visual-language embedding space.

That is OpenAI's CLIP. The algorithm doesn't change. The exact same loss, the exact same architecture skeleton, the exact same evaluation recipe. The differences are entirely in the magnitudes.

Why Toy CLIP Works So Well Here But Not at ImageNet

Our toy CLIP reaches $100\%$ on $16$-way classification. The same architecture would get roughly $0\%$ on ImageNet's $1{,}000$ classes. Why the gap?

Scale. ImageNet classes span millions of training images and a vocabulary of tens of thousands of words. Our toy handles $4 \times 4 = 16$ classes; that is what its capacity is for.

Caption diversity. Real CLIP's training corpus has captions that vary in length, vocabulary, and style. Our captions are template strings ("a red circle"). The model never had to learn robustness to caption variations because there weren't any.

The point of our experiment is not to compete with CLIP at scale — it is to demonstrate that the entire CLIP recipe (contrastive InfoNCE on (image, caption) pairs, L2-normalised embeddings, learnable temperature, zero-shot classification via caption ranking) is genuinely just an $80$-line PyTorch program plus enough data.

What This Demonstrates

Full code on GitHub: github.com/soveshmohapatra/CLIP