Back to Deconstructing LoRA Hub

Deconstructing LoRA

Part 1: The Low-Rank Decomposition

Overview

LoRA works because weight updates during fine-tuning have very low intrinsic rank. The whole technique is one equation: $y = Wx + (BA)x$ where $A, B$ are small trainable matrices and $W$ is frozen.

The Affordability Problem

A 70B-parameter model has $W \in \mathbb{R}^{d_\text{in} \times d_\text{out}}$ at every linear layer. Full fine-tuning means updating every entry — gradient, AdamW optimizer state, GPU memory. For a 70B model that's hardware unavailable to most practitioners. Hu et al. (2021) observed that the change in $W$ produced by fine-tuning has very low intrinsic rank — even when $W'$ itself isn't low-rank.

The Decomposition

$$ \Delta W \approx B A, \qquad A \in \mathbb{R}^{r \times d_\text{in}}, \qquad B \in \mathbb{R}^{d_\text{out} \times r}. $$

The forward pass becomes $y = Wx + BAx$. $W$ is frozen. Only $A$ and $B$ are trained. For $d_\text{in} = d_\text{out} = 4096$ and $r = 8$, that's a $250\times$ reduction in trainable parameters per layer.

Why the Intrinsic Rank Is Low

Two intuitions: Geometric. The pretrained weight $W$ already encodes general structure; fine-tuning needs to modify only a small region of input-output space. Empirical. Aghajanyan et al. (2020) projected fine-tuning trajectories onto random low-dim subspaces and asked how many dimensions are needed to recover full performance — the answer was usually a few hundred, regardless of model size.

Initialisation Matters

The product $BA$ at initialisation is exactly zero. The LoRA-augmented model is identical to the pretrained model at the start of fine-tuning.

The Alpha/r Scaling Trick

$$ y = Wx + \frac{\alpha}{r} \cdot BA\,x. $$

$\alpha$ is a hyperparameter typically set to $2r$. Dividing by $r$ keeps the effective update magnitude constant when rank changes, so learning rates do not need to be retuned across rank settings.

Summary