Deconstructing CLIP
Contrastive image-text pretraining in 80 lines of PyTorch — two encoders, a shared embedding space, a single InfoNCE loss, 100% zero-shot on 16-way classification.
Part 1
Contrastive Image-Text Pairs
InfoNCE in detail, the learnable temperature, why batch size is the most important hyperparameter, how 'zero-shot' classification actually works.
Part 2
Eighty Lines of PyTorch
ImageEncoder (CNN), TextEncoder, full CLIP wrapper, symmetric InfoNCE loss.
View Code on GitHub
Part 3
100% Zero-Shot in 14s
75K-parameter CLIP on 2,720 colored-shape pairs. 16 tight (color, shape) clusters emerge in the embedding space without supervision on either concept.