For five years 'diffusion model' meant a U-Net. Peebles & Xie (2023) replaced it with a Vision Transformer. By 2024 every major production diffusion model — SD3, Sora, PixArt-α — moved to DiT.
The U-Net-to-Transformer Transition
DDPM (2020), GLIDE, Latent Diffusion, Stable Diffusion 1/2, Imagen — all use a convolutional U-Net with cross-attention for text conditioning. DiT replaces the U-Net entirely with a Vision Transformer that processes the image as a sequence of patches, conditioning everything via adaLN-Zero.
The Architecture
Each DiT block applies:
The six modulation parameters $(\gamma_a, \beta_a, \alpha_a, \gamma_m, \beta_m, \alpha_m)$ are produced by a small MLP from the conditioning vector $\mathbf{c} = \text{embed}(t) + \text{embed}(y)$. LayerNorm has its own affine disabled — scale and shift are entirely data-dependent.
Why adaLN-Zero?
adaLN: LayerNorm's scale and shift come from the conditioning vector, not learned per-layer. Gating: a learnable $\alpha$ on each residual. Zero init: the modulation MLP is initialised to zero, so every block starts as identity. The model gradually un-zeros it as it figures out what each block should do. This is empirically essential for stable early training.
Why Not Cross-Attention?
Cross-attention adds parameters and computation at every layer. adaLN-Zero only needs a small MLP per block. For sequence-independent conditioning (a single time + class vector) modulation is sufficient. Production text-conditional DiT models (SD3, Sora) add a small text cross-attention back for the text branch, but use adaLN-Zero for everything else.
Inductive Bias as a Resource
U-Nets bake in three priors: locality, translation equivariance, multi-scale processing. These are correct for natural images and let a U-Net learn good image structure from very little data.
DiT has none of these. It must learn them from data. With small datasets this is a disadvantage. With large datasets it is an advantage — the model is free to learn whatever bias the data actually wants, including ones a U-Net could not represent. The crossover is empirical: somewhere around several million images at modern resolutions.