Back to Projects

Deconstructing DiT

The Diffusion Transformer (Peebles & Xie, 2023) — the architecture behind Stable Diffusion 3 and Sora. Built from scratch in 250 lines with adaLN-Zero conditioning.

Part 1

Diffusion Transformers and adaLN-Zero

The architecture, how adaLN-Zero conditioning replaces cross-attention, why zero-init makes early training stable, the inductive-bias tradeoff with U-Net.

Part 2

250 Lines of PyTorch

Noise scheduler, PatchEmbed, DiTBlock with adaLN-Zero, full DiT model, sampling.
View Code on GitHub

Part 3

Inductive Bias as a Resource

Class conditioning works, shape geometry needs scale. UNets win at small data; DiT wins at LAION scale. The crossover is the whole story.