Deconstructing DiT: Part 1 — Diffusion Transformers and adaLN-Zero

Overview

For five years 'diffusion model' meant a U-Net. Peebles & Xie (2023) replaced it with a Vision Transformer. By 2024 every major production diffusion model — SD3, Sora, PixArt-α — moved to DiT.

The U-Net-to-Transformer Transition

DDPM (2020), GLIDE, Latent Diffusion, Stable Diffusion 1/2, Imagen — all use a convolutional U-Net with cross-attention for text conditioning. DiT replaces the U-Net entirely with a Vision Transformer that processes the image as a sequence of patches, conditioning everything via adaLN-Zero.

The Architecture

Each DiT block applies:

\mathbf{h} = \text{LayerNorm}(\mathbf{x}) \cdot (1 + \gamma_a) + \beta_a, \quad \mathbf{x} \leftarrow \mathbf{x} + \alpha_a \cdot \text{SelfAttention}(\mathbf{h}),

\mathbf{h} = \text{LayerNorm}(\mathbf{x}) \cdot (1 + \gamma_m) + \beta_m, \quad \mathbf{x} \leftarrow \mathbf{x} + \alpha_m \cdot \text{MLP}(\mathbf{h}).

The six modulation parameters $(\gamma_a, \beta_a, \alpha_a, \gamma_m, \beta_m, \alpha_m)$ are produced by a small MLP from the conditioning vector $\mathbf{c} = \text{embed}(t) + \text{embed}(y)$. LayerNorm has its own affine disabled — scale and shift are entirely data-dependent.

Why adaLN-Zero?

adaLN: LayerNorm's scale and shift come from the conditioning vector, not learned per-layer. Gating: a learnable $\alpha$ on each residual. Zero init: the modulation MLP is initialised to zero, so every block starts as identity. The model gradually un-zeros it as it figures out what each block should do. This is empirically essential for stable early training.

Why Not Cross-Attention?

Cross-attention adds parameters and computation at every layer. adaLN-Zero only needs a small MLP per block. For sequence-independent conditioning (a single time + class vector) modulation is sufficient. Production text-conditional DiT models (SD3, Sora) add a small text cross-attention back for the text branch, but use adaLN-Zero for everything else.

Inductive Bias as a Resource

U-Nets bake in three priors: locality, translation equivariance, multi-scale processing. These are correct for natural images and let a U-Net learn good image structure from very little data.

DiT has none of these. It must learn them from data. With small datasets this is a disadvantage. With large datasets it is an advantage — the model is free to learn whatever bias the data actually wants, including ones a U-Net could not represent. The crossover is empirical: somewhere around several million images at modern resolutions.