Deconstructing CLIP: Part 2 - Eighty Lines

Overview

Part 1 made the structural argument: CLIP is two encoders, one shared embedding space, and one symmetric InfoNCE loss. The supervision comes from pairing (an image and its caption) rather than from labels. Now we implement it from scratch in pure PyTorch. The full CLIP machinery — image encoder, text encoder, normalisation, learnable temperature, symmetric loss — is about $80$ lines of code.

The interesting parts are not the encoder architectures (a small CNN, a small token-pool) but the small choices that turn out to be load-bearing: the L2 normalisation that makes dot product into cosine similarity, the learnable temperature parameterised in log-space, and the bidirectional formulation of InfoNCE.

File Map

Three files: dataset.py generates the synthetic colored-shapes corpus with templated captions; clip.py defines the two encoders and the CLIP wrapper class plus the InfoNCE loss; train.py runs the training loop and zero-shot evaluation. The substantive CLIP code lives in clip.py at roughly $80$ lines; the dataset and training scripts add another $\sim 200$ lines of glue.

Synthetic Dataset

Four colors (red, blue, green, yellow) crossed with four shapes (circle, square, triangle, cross) gives $16$ classes. Each example is a $32 \times 32$ image of one shape drawn in one color on a near-black background, paired with the caption "a {color} {shape}".

The vocabulary contains only $10$ tokens: {, a, red, blue, green, yellow, circle, square, triangle, cross}. Captions are encoded as fixed-length $4$-token sequences with padding when needed. This is deliberately small — it lets us inspect every learned token embedding directly, and the fixed length avoids the masking complexity that variable-length captions would require.

$200$ examples per class times $16$ classes is $3{,}200$ total pairs, with $85\%$ used for training and $15\%$ held out for zero-shot evaluation. This is roughly $5$ orders of magnitude smaller than OpenAI's WIT-400M corpus. What survives at this scale is the algorithm itself; what doesn't is the diversity of real-world data.

ImageEncoder (Small CNN)

class ImageEncoder(nn.Module):
    def __init__(self, embed_dim=64):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1), nn.GELU(),
            nn.Conv2d(32, 32, 3, padding=1), nn.GELU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1), nn.GELU(),
            nn.MaxPool2d(2),
            nn.Conv2d(64, 64, 3, padding=1), nn.GELU(),
            nn.AdaptiveAvgPool2d(1),
        )
        self.proj = nn.Linear(64, embed_dim)

    def forward(self, x):
        h = self.conv(x).flatten(1)
        return self.proj(h)

A standard small CNN: five conv layers with GELU activations, two max-pool downsamples ($32 \to 16 \to 8$), then adaptive average pooling to $1 \times 1$, then a linear projection to the shared $64$-dimensional embedding space.

The architecture is incidental. Real CLIP uses ViT-B/32 or ResNet-50 with around $100$M parameters. Our small CNN has about $40{,}000$. The architecture choice is independent of the contrastive objective — the same loss and training procedure work with any encoder that produces a fixed-dimensional output vector per image.

AdaptiveAvgPool2d as the global aggregator. After the convolutions, the feature map is $64 \times 8 \times 8$. Adaptive average pooling reduces this to $64 \times 1 \times 1$ — a single $64$-dimensional summary of the image. This is the standard "global pooling" approach to converting a spatial feature map into a vector. ViT-based encoders use the [CLS] token instead; both serve the same role.

TextEncoder (Token Embedding + Mean Pool)

class TextEncoder(nn.Module):
    def __init__(self, vocab_size, max_len, embed_dim=64, hidden=64):
        super().__init__()
        self.tok = nn.Embedding(vocab_size, hidden, padding_idx=0)
        self.pos = nn.Embedding(max_len, hidden)
        self.norm = nn.LayerNorm(hidden)
        self.proj = nn.Linear(hidden, embed_dim)

    def forward(self, captions):
        B, T = captions.shape
        pos = torch.arange(T, device=captions.device).unsqueeze(0)
        h = self.tok(captions) + self.pos(pos)
        mask = (captions != 0).float().unsqueeze(-1)
        h = (h * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1.0)
        return self.proj(self.norm(h))

For a $10$-word vocabulary a Transformer is theatre. Instead: token embeddings, positional embeddings, sum, mask out padding, mean-pool over non-padding positions, LayerNorm, project.

The padding_idx=0 argument. This tells PyTorch that token id $0$ is padding and should have a zero embedding that is never updated by gradients. Without it, the model would learn an embedding for the pad token, which is wrong — pad tokens should contribute nothing to the caption representation.

The mean-pool with explicit masking. (h * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1.0) computes the mean only over non-padding positions. mask is $1$ for real tokens and $0$ for padding. The clamp(min=1.0) prevents division by zero in the (impossible-here but defensive) case of an all-padding caption.

Why not a Transformer? For a 10-word vocabulary and 4-token captions, a Transformer would have nothing to do. Mean-pooling captures the bag-of-tokens information that the captions provide. Real CLIP uses a 12-layer Transformer text encoder ($\sim 60$M parameters); on natural language with variable structure and dependencies, the Transformer adds real value. At our scale, it doesn't.

The principle: the architecture has to be at least capable of representing the necessary distinctions in the dataset. For 16 classes built from 8 vocabulary tokens, mean-pool is enough.

L2 Normalization — The First Critical Detail

img_emb = F.normalize(self.image_encoder(images), dim=-1)
txt_emb = F.normalize(self.text_encoder(captions), dim=-1)

Both encoders' outputs are L2-normalized to unit length before computing similarities. This is essential, not cosmetic. Three reasons:

(1) Dot product becomes cosine similarity. For unit vectors, $u \cdot v = \|u\| \|v\| \cos \theta = \cos \theta$. The similarity is bounded to $[-1, 1]$ regardless of the raw encoder output magnitudes. This is exactly what we want: a scale-free measure of how similar two embeddings are.

(2) It removes the magnitude degree of freedom. Without normalization, the model could trivially minimise the contrastive loss by making the embedding magnitudes huge (positive pairs) or tiny (negative pairs) without changing the angles at all. Normalization forces the model to learn directions, not scales.

(3) It bounds the temperature search. The similarities are guaranteed to lie in $[-1, 1]$, which means the temperature $\tau$ can be reasoned about: $1/\tau = 100$ gives a maximum logit of $100$, well within softmax's usable range. Without normalization, the temperature would have to absorb arbitrary encoder magnitudes — making it both unstable and impossible to tune.

The CLIP Wrapper

class CLIP(nn.Module):
    def __init__(self, vocab_size, max_len, embed_dim=64):
        super().__init__()
        self.image_encoder = ImageEncoder(embed_dim=embed_dim)
        self.text_encoder  = TextEncoder(vocab_size, max_len, embed_dim=embed_dim)
        self.logit_scale = nn.Parameter(torch.tensor(math.log(1.0 / 0.07)))

    def forward(self, images, captions):
        img_emb = F.normalize(self.image_encoder(images), dim=-1)
        txt_emb = F.normalize(self.text_encoder(captions), dim=-1)
        scale = self.logit_scale.clamp(0, math.log(100.0)).exp()
        logits = scale * img_emb @ txt_emb.T
        return logits, img_emb, txt_emb

The Learnable Temperature

The line self.logit_scale = nn.Parameter(torch.tensor(math.log(1.0 / 0.07))) deserves attention. CLIP makes $\tau$ a learnable scalar — parameterised in log-space, initialised at $\log(1/0.07) \approx 2.66$, and clamped to keep $1/\tau \leq 100$.

Why log-space? Because $\tau > 0$. If we parameterised $\tau$ directly and the optimizer ever tried to push it negative, you would get a sign-flipped softmax with bizarre dynamics. By storing $\log(1/\tau)$ instead, the actual temperature $\exp(\text{logit\_scale})$ is guaranteed positive regardless of what value the optimizer gradient-updates it to.

Why clamp $1/\tau \leq 100$? Because runaway temperatures destroy training. Imagine the optimizer pushes $1/\tau$ to $10^{10}$. The softmax becomes a one-hot distribution — all probability on the maximum entry, zero everywhere else. Gradients through softmax vanish for the non-maximum entries, the model can no longer distinguish "almost matches" from "doesn't match at all", and training collapses. The clamp prevents this failure mode at the cost of fixing the maximum softmax sharpness.

What does $\tau$ control? The softmax temperature determines how peaked the matching distribution is. Low $\tau$ (= large $1/\tau$) makes the softmax sharp: only the single best match gets significant probability, gradients focus narrowly. High $\tau$ flattens the distribution: many candidates retain non-trivial probability, gradients spread across them.

By making $\tau$ learnable, CLIP lets the model adapt its own sharpness during training. Early on, when embeddings are uninformative, a flat distribution is appropriate — many candidates are plausibly correct. Late in training, when embeddings are sharp, a peaked distribution is appropriate — the matching pair should dominate. The learnable $\tau$ adapts to this naturally; in practice it approaches the clamp value during late training.

The InfoNCE Loss

def info_nce_loss(logits):
    B = logits.size(0)
    targets = torch.arange(B, device=logits.device)
    loss_i = F.cross_entropy(logits,   targets)
    loss_t = F.cross_entropy(logits.T, targets)
    return (loss_i + loss_t) / 2

Five lines. The negatives are automatic — they are the other examples in the same batch.

The targets are torch.arange(B). For each row $i$ of the $B \times B$ logit matrix, the correct match is column $i$. So the targets are simply $[0, 1, 2, \ldots, B-1]$. PyTorch's cross_entropy then computes the standard softmax cross-entropy with these targets.

Why two losses, not one? One direction is "given image $i$, which caption matches?" — that's F.cross_entropy(logits, targets). The other direction is "given caption $i$, which image matches?" — that's F.cross_entropy(logits.T, targets). CLIP averages both. This is the "symmetric" in symmetric InfoNCE.

Empirically, the symmetric version converges noticeably faster than either direction alone. The reason is structural: training only image-to-text produces an embedding space optimised for that direction; the reverse direction may not be well-conditioned. Training both keeps the embedding space symmetric.

Training Loop

opt = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.05)
sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=epochs)

for epoch in range(epochs):
    for images, captions in dataloader:
        logits, _, _ = model(images, captions)
        loss = info_nce_loss(logits)
        opt.zero_grad(set_to_none=True)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        opt.step()
    sched.step()

Standard AdamW + cosine schedule. The $\eta = 5 \times 10^{-4}$ and weight decay $0.05$ are chosen to match what real CLIP papers use. Gradient clipping at $1.0$ keeps the temperature from runaway updates when the logit scale parameter receives large gradients.

The batch size matters more than usual. Recall: in each batch of $B$ pairs, each image is contrasted against $B - 1$ negative captions. Larger $B$ = more negatives per positive = a harder contrastive problem = sharper learned embeddings. We use $B = 64$ here. OpenAI's CLIP trained at $B = 32{,}768$. That is not a typo; that is genuinely the most important hyperparameter at scale.

Zero-Shot Evaluation

@torch.no_grad()
def zero_shot_accuracy(model, X, Y, class_captions):
    model.eval()
    img_emb = F.normalize(model.image_encoder(X), dim=-1)
    txt_emb = F.normalize(model.text_encoder(class_captions), dim=-1)
    preds = (img_emb @ txt_emb.T).argmax(dim=1)
    return (preds == Y).float().mean().item()

Encode all $16$ class captions ("a red circle", "a red square", ..., "a yellow cross") once. Encode each test image. Compute the $16$-way similarity for each image. Argmax to predict the class. Compare to ground truth.

This is the "zero-shot" classification pattern, and it deserves slowing down on. We never trained a classifier head. No softmax over class indices, no fitted linear layer mapping image embeddings to $16$ outputs. The classification capability emerges from the contrastive training: the model learned to put image embeddings near their matching caption embeddings, and "classifying" a new image is just looking up which class caption is closest.

The classifier is the text encoder. To add a new class, you don't retrain — you encode the new class caption and add it to the similarity computation. To swap to a different classification task, you swap the class captions. The same trained CLIP model can do dog/cat classification, indoor/outdoor classification, or any other taxonomy you express in natural language, with no parameter updates.

This generality is the headline claim of the CLIP paper. The model learns one embedding space; downstream tasks select directions in it via natural-language prompts.

What This Implementation Skips

Real CLIP implementations add several things we leave out for clarity:

Larger encoders. ViT-B/32 or ResNet-50 for images; 12-layer Transformer for text. Our toy uses a small CNN and mean-pool because the dataset is small.

Mixed-precision training. Real CLIP trains in bf16/fp16 with selective fp32 accumulation around the InfoNCE loss for numerical stability at large batch sizes. We use whatever precision PyTorch defaults to.

Distributed-batch InfoNCE. At $B = 32{,}768$ across many GPUs, the InfoNCE loss requires all-gathering embeddings across the cluster so each GPU's loss sees the full batch's negatives. This is non-trivial engineering. We run on a single device.

Tokenization. Real CLIP uses BPE for text (a $49{,}152$-token vocabulary). We use a fixed $10$-word vocabulary.

What survives at our scale: the contrastive recipe itself works at any data scale. What doesn't survive: the impressive zero-shot generalisation that emerges only at LAION-400M scale.

What Part 3 Tests

With CLIP in hand, Part 3 trains the model for $30$ epochs on $2{,}720$ image-caption pairs and evaluates zero-shot accuracy on the $480$ held-out images. The result: $100\%$ zero-shot on 16-way classification in $14$ seconds. The t-SNE projection of the learned embedding space reveals $16$ tight clusters — one per (color, shape) combination — emerging from the contrastive objective alone, without ever being told that "color" and "shape" are separable attributes.

Full code on GitHub: github.com/soveshmohapatra/CLIP