Back to Projects

Deconstructing CLIP

Contrastive image-text pretraining in 80 lines of PyTorch — two encoders, a shared embedding space, a single InfoNCE loss, 100% zero-shot on 16-way classification.

Part 1

Contrastive Image-Text Pairs

InfoNCE in detail, the learnable temperature, why batch size is the most important hyperparameter, how 'zero-shot' classification actually works.

Part 2

Eighty Lines of PyTorch

ImageEncoder (CNN), TextEncoder, full CLIP wrapper, symmetric InfoNCE loss.
View Code on GitHub

Part 3

100% Zero-Shot in 14s

75K-parameter CLIP on 2,720 colored-shape pairs. 16 tight (color, shape) clusters emerge in the embedding space without supervision on either concept.