Deconstructing DiT
The Diffusion Transformer (Peebles & Xie, 2023) — the architecture behind Stable Diffusion 3 and Sora. Built from scratch in 250 lines with adaLN-Zero conditioning.
Part 1
Diffusion Transformers and adaLN-Zero
The architecture, how adaLN-Zero conditioning replaces cross-attention, why zero-init makes early training stable, the inductive-bias tradeoff with U-Net.
Part 2
250 Lines of PyTorch
Noise scheduler, PatchEmbed, DiTBlock with adaLN-Zero, full DiT model, sampling.
View Code on GitHub
Part 3
Inductive Bias as a Resource
Class conditioning works, shape geometry needs scale. UNets win at small data; DiT wins at LAION scale. The crossover is the whole story.