Deconstructing CLIP: Part 1 — Contrastive Learning of Image-Text Pairs

Overview

CLIP did not invent contrastive learning. It scaled it. Two encoders, a shared embedding space, a single InfoNCE loss — at 400M pairs of (image, caption), the result was dramatic.

The Setup

Let $f_\text{img}, f_\text{txt}$ project images and text into a shared $d$-dimensional space. For an image $x$ and caption $y$:

u = \frac{f_\text{img}(x)}{\|f_\text{img}(x)\|}, \qquad v = \frac{f_\text{txt}(y)}{\|f_\text{txt}(y)\|}.

L2 normalisation makes $u^\top v$ exactly the cosine similarity and removes the magnitude degree of freedom — the model is forced to learn directions, not scales.

The InfoNCE Loss

For a batch of $B$ pairs, compute the $B \times B$ similarity matrix $S = \frac{1}{\tau} U V^\top$. The loss is two symmetric cross-entropies:

\mathcal{L}_\text{img} = -\frac{1}{B} \sum_i \log \frac{\exp(S_{ii})}{\sum_j \exp(S_{ij})}, \quad \mathcal{L}_\text{txt} = -\frac{1}{B} \sum_i \log \frac{\exp(S_{ii})}{\sum_j \exp(S_{ji})}

\mathcal{L} = (\mathcal{L}_\text{img} + \mathcal{L}_\text{txt}) / 2.

Each row of $S$ is a $B$-way classification: "which caption matches this image?" Target: the diagonal. Symmetric for columns.

Why Temperature Is Learnable

$\tau$ controls softmax peakedness. CLIP parameterises $1/\tau$ as $\exp(\text{logit\_scale})$ for stability, initialises at $\tau = 0.07$, and clamps $1/\tau \leq 100$. The model learns its own sharpness; in practice $1/\tau$ approaches the clamp during late training.

Why Batch Size Is The Most Important Hyperparameter

In a batch of $B$ pairs, each image is contrasted against $B-1$ negative captions. Larger $B$ = more negatives per positive = sharper embedding space. OpenAI's CLIP trained at $B = 32{,}768$. That isn't a typo. Doubling batch size effectively doubles comparisons per gradient step, which is why CLIP scales so well with compute.

Zero-Shot Classification, Demystified

At inference: encode a class name ("a photo of a dog") via the text encoder, compute cosine similarity with the image embedding, assign to the most-similar class. "Zero-shot" means no labelled examples of that class — but CLIP did see captions during training, and class names appear in captions. The model learned that "a dog" should be near images of dogs by accident of having seen many such pairs.

The reframe: CLIP's zero-shot is not magic emergence. It is exploitation of natural-language structure already inside the captions. When CLIP got 76% on ImageNet zero-shot, it was because ImageNet class names appeared often enough in the LAION captions.

Summary

Two encoders, one shared embedding space, one symmetric contrastive loss.
L2-normalisation makes the dot product cosine similarity.
Learnable temperature lets the model adapt softmax sharpness.
Batch size is the most important hyperparameter — the number of negatives per positive.
"Zero-shot" is the natural-language structure of captions surfacing at inference time.