Deconstructing LoRA: Part 3 - Rank-2 Beats Full Fine-Tune

The Claim Being Tested

Part 1's theoretical claim was strong: fine-tuning weight updates have low intrinsic rank, so a rank-$r$ approximation $BA$ with $r$ much smaller than the model dimension suffices for most adaptation tasks. If true, LoRA at low rank should reach the same downstream performance as full fine-tuning while training a small fraction of the parameters.

This part constructs a controlled experiment to test that claim. We pretrain a small MLP on a base task, define a related-but-different target task as a $45^\circ$ rotation of the same data, and adapt the pretrained model three ways: training from scratch, fully fine-tuning, and LoRA at three different ranks. The trainable-parameter counts vary by a factor of $20$ across these methods.

Experimental Design

The controlled-experiment structure is deliberately small enough to fit on a laptop in seconds, but rich enough to expose LoRA's properties cleanly.

Architecture. A $4$-layer MLP with hidden dimension $128$. Three Linear layers ($2 \to 128 \to 128 \to 128 \to 2$) with tanh activations in between. Total parameter count: $33{,}666$. This is intentionally small — large enough that fine-tuning has parameters to work with, small enough that the toy is fast.

Base task. Two-moons binary classification with $\sigma = 0.15$ noise, $1{,}000$ points. Standard setup. We pretrain the MLP on this task until it reaches near-perfect accuracy.

Target task. The same data rotated by $45^\circ$. This is a domain-shifted version of the base task: the topology of the decision boundary is the same (two interleaving half-moons) but the orientation differs. The pretrained model has learned features specific to the original orientation; it has not learned features for the rotated one.

Zero-shot transfer. Evaluating the pretrained model on the rotated task gives $63.2\%$ accuracy. The model retained some useful structure (above the $50\%$ chance baseline) but is clearly not configured for the new orientation. There is a real adaptation problem to solve.

Adaptation strategies (5).

Train a fresh MLP from scratch on the rotated task (no transfer).
Full fine-tune the pretrained MLP — every parameter trainable.
LoRA fine-tune at rank $r = 2$.
LoRA fine-tune at rank $r = 4$.
LoRA fine-tune at rank $r = 8$.

All five run for $60$ epochs at $\eta = 10^{-3}$ (LoRA variants use $3 \times 10^{-3}$ to compensate for the smaller effective learning signal on the restricted subspace). All five start from the same initial weights — for LoRA, that means the pretrained base; for from-scratch, a fresh random initialisation with a deterministic seed.

Results

Method	Trainable params	% of full FT	Final acc	Time
From scratch	$33{,}666$	$100.00\%$	$98.90\%$	$0.4$ s
Full fine-tune	$33{,}666$	$100.00\%$	$99.00\%$	$0.5$ s
LoRA ($r = 2$)	$\mathbf{1{,}544}$	$\mathbf{4.59\%}$	$\mathbf{99.10\%}$	$0.5$ s
LoRA ($r = 4$)	$3{,}088$	$9.17\%$	$99.10\%$	$0.5$ s
LoRA ($r = 8$)	$6{,}176$	$18.34\%$	$98.90\%$	$0.5$ s

LoRA at rank $2$ — $1{,}544$ trainable parameters — matches or marginally beats full fine-tuning. The remaining $32{,}122$ parameters of the model are frozen and never receive a gradient. The intrinsic-rank hypothesis from Part 1 holds up: a tiny low-rank delta captures essentially everything full fine-tuning would have done.

Why $r = 8$ Does Not Outperform $r = 2$

Notice that doubling rank from $2$ to $4$ keeps accuracy at $99.1\%$, and doubling again to $r = 8$ actually drops accuracy to $98.9\%$. More capacity is not better. Why?

The intrinsic rank of the useful weight change for this task is small. Once the rank is large enough to capture all the useful directions of variation, adding more rank just gives the model extra dimensions to fit noise. Rank-$8$ has $4\times$ the capacity of rank-$2$; with the same training data and same number of epochs, that extra capacity goes into overfitting the noise rather than improving signal.

This is the same lesson as in subspace methods generally: when the true effect lives in a small subspace, increasing the subspace dimension past it adds variance without adding signal. Rank is a hyperparameter you tune from below. Start at $r = 2$ or $r = 4$, increase if quality saturates below your target, and stop. Rank $> 16$ is rarely useful for any fine-tuning task short of major distribution shift (cross-language adaptation, modality adaptation, etc.).

Hu et al.'s original LoRA paper reported the same pattern across GLUE, SuperGLUE, and several generation benchmarks on GPT-2 and GPT-3. Rank $4$ or $8$ saturated most tasks; rank $64$ was only needed for very-distant adaptations.

Why From-Scratch Almost Matches Fine-Tuning

An aside worth noticing: from-scratch training reaches $98.9\%$ — within $0.1$ points of full fine-tuning. On this toy task, the pretraining did not provide much advantage. The reason is that the task is small enough ($1{,}000$ points, simple decision boundary) that even random initialisation reaches near-perfect accuracy in $60$ epochs.

This is misleading at scale. On real tasks (training a $7$B-parameter language model from scratch vs fine-tuning a pretrained one on a specialised corpus), from-scratch is dramatically more expensive and produces worse results. The from-scratch baseline in our experiment is a sanity check that the rotated task is indeed learnable; it doesn't tell you much about real-world fine-tuning economics.

Practical Implications

The benchmark numbers are toy. The practical implications of LoRA's low-rank-suffices property are not toy at all.

Memory. Only LoRA parameters require optimizer state. AdamW stores two moment buffers ($m$ and $v$) per trainable parameter. For a $7$B-parameter model with rank-$8$ LoRA on attention layers only (a typical configuration), the trainable parameter count is roughly $8$M. AdamW state shrinks from $\sim 56$ GB (two fp32 buffers $\times$ 7B params) to roughly $60$ MB. This is the difference between needing an A100/H100 and fitting on a $24$ GB consumer GPU — a thousand-fold reduction in optimizer state.

Serving. Ship one frozen base model plus many small LoRA adapters (one per fine-tuned task or user or domain). Each adapter is megabytes; the base is the gigabyte-scale piece. Switching tasks at serving time is a parameter-load (load a different adapter), not a model-reload. This is what makes multi-tenant fine-tuned serving (Replicate, Together AI, Anyscale) economically viable. Without LoRA, every fine-tuned model would need its own dedicated GPU memory.

Composition. LoRA adapters are linear. You can sum or interpolate two LoRAs trained on different tasks and get useful behaviour from the combined adapter. The image-generation ecosystem (Civitai, the entire Stable Diffusion plugin marketplace) is built on this property — users stack a "character LoRA" with a "style LoRA" and a "pose LoRA" to compose a final image. None of that works if the LoRAs were attached to slightly-drifted bases.

Reversibility. Because the base is untouched, you can undo a fine-tune by removing the adapter. Full fine-tuning destroys the base; LoRA preserves it. If a fine-tune produces unwanted behaviour, you can roll back instantly.

Why the Frozen Base Is Load-Bearing

The freezing is what makes all four of those practical properties work. If the base drifted even slightly during LoRA training, you would lose:

The "ship one base + many adapters" property — each adapter would be tied to a slightly different base, and you couldn't mix them at serving time without quality degradation.

The catastrophic-forgetting guarantee — the base still knows what it knew before training the adapter, because nothing in it changed. Drift the base and you lose the base's pre-training abilities in exchange for the new task; that is exactly the trade-off that full fine-tuning forces you to make.

The composability of adapters — adapters from different runs would operate on slightly different bases and could not be summed cleanly.

The reversibility — you could not undo the fine-tune if the base had changed.

So freezing is not a memory optimisation — it is what makes adapter-style fine-tuning work as a deployable production pattern. The memory savings are a happy consequence.

Generalisation Beyond This Experiment

At LLM scale, the same low-rank-suffices pattern holds with even more dramatic numbers:

Llama-3 70B with rank-$16$ LoRA on attention only: $\sim 60$M trainable parameters out of $70$B total — roughly $0.09\%$. Achieves comparable quality to full fine-tuning on most downstream tasks.

Hu et al.'s original LoRA paper reports rank-$4$ LoRA matching full fine-tuning across GLUE, SuperGLUE, and several generation benchmarks on GPT-2 ($125$M) and GPT-3 ($175$B).

QLoRA (Dettmers et al., 2023) combined 4-bit quantisation of the frozen base with LoRA fine-tuning, enabling $65$B-parameter fine-tuning on a single $48$ GB GPU. The economic implications were immediate: research labs that previously couldn't afford to fine-tune at the largest scales suddenly could.

The mechanism is the same as in our moons experiment, just at thousands of times the scale. LoRA is one of the rare techniques in deep learning that scales better the larger the model — because larger models have more redundant capacity, and the intrinsic rank of useful adaptations stays small while the base parameter count grows. The fraction of the model that LoRA touches shrinks with scale.

When LoRA Does Not Suffice

LoRA is not magic. Specific failure modes:

Cross-language adaptation. Adapting an English-pretrained model to Japanese or Arabic requires substantially more capacity than English-to-English domain shift. Rank $64$ or higher is sometimes needed, at which point the trainable-parameter count approaches full fine-tuning anyway.

Modality shifts. Adapting a vision-language model to a new modality (audio, sensor data) requires large architectural changes the LoRA wrapper cannot make. The base architecture must support the new modality; LoRA only adjusts existing weights, not topology.

Major distribution shifts. If the fine-tuning task involves entirely new vocabulary, new task formats, new output structures, the low-rank assumption breaks down. Continued pretraining (full fine-tuning of all weights) is sometimes the right answer.

The empirical rule: if the target task is "similar shape" to the pretraining task — same modality, similar vocabulary, related domain — LoRA at low rank works. If the target is genuinely different in kind, full fine-tuning may be necessary.

Summary

LoRA at rank $2$ — $1{,}544$ trainable parameters, $4.59\%$ of full — matches or beats full fine-tuning on the moons-rotation adaptation task.
Higher ranks ($r = 4, 8$) do not improve quality and start to overfit. The intrinsic rank of useful updates for this task is genuinely small.
The practical implications — memory (1000$\times$ less optimizer state), serving (one base, many adapters), composability, reversibility — are why every consumer-grade LLM fine-tune in the last two years uses LoRA.
LoRA scales better the larger the base model: fraction of trainable parameters shrinks as base size grows.

Full code on GitHub: github.com/soveshmohapatra/LoRA