LoRA works because weight updates during fine-tuning have very low intrinsic rank. The whole technique is one equation: $y = Wx + (BA)x$ where $A, B$ are small trainable matrices and $W$ is frozen.
The Affordability Problem
A 70B-parameter model has $W \in \mathbb{R}^{d_\text{in} \times d_\text{out}}$ at every linear layer. Full fine-tuning means updating every entry — gradient, AdamW optimizer state, GPU memory. For a 70B model that's hardware unavailable to most practitioners. Hu et al. (2021) observed that the change in $W$ produced by fine-tuning has very low intrinsic rank — even when $W'$ itself isn't low-rank.
The Decomposition
The forward pass becomes $y = Wx + BAx$. $W$ is frozen. Only $A$ and $B$ are trained. For $d_\text{in} = d_\text{out} = 4096$ and $r = 8$, that's a $250\times$ reduction in trainable parameters per layer.
Why the Intrinsic Rank Is Low
Two intuitions: Geometric. The pretrained weight $W$ already encodes general structure; fine-tuning needs to modify only a small region of input-output space. Empirical. Aghajanyan et al. (2020) projected fine-tuning trajectories onto random low-dim subspaces and asked how many dimensions are needed to recover full performance — the answer was usually a few hundred, regardless of model size.
Initialisation Matters
- $A$ is initialised with small Gaussian noise.
- $B$ is initialised to zero.
The product $BA$ at initialisation is exactly zero. The LoRA-augmented model is identical to the pretrained model at the start of fine-tuning.
The Alpha/r Scaling Trick
$\alpha$ is a hyperparameter typically set to $2r$. Dividing by $r$ keeps the effective update magnitude constant when rank changes, so learning rates do not need to be retuned across rank settings.
Summary
- Fine-tuning's weight delta has low intrinsic rank.
- LoRA factorises that delta as $BA$ with rank $r \ll \min(d_\text{in}, d_\text{out})$.
- Asymmetric initialisation ensures the model starts identical to the base.
- Total parameter savings: typically 100-1000$\times$.