Part 1 introduced the central claim: weight updates during fine-tuning have very low intrinsic rank, so a rank-$r$ approximation $BA$ to the full delta $\Delta W$ suffices for most tasks. Now we turn that into PyTorch code. The full implementation is roughly $50$ lines: a wrapper class, a recursive injector that swaps every nn.Linear in a model with its LoRA-wrapped version, a few helper functions for parameter bookkeeping, and a $5$-line merge function for zero-overhead inference.
The interesting parts are not the data structures — those are obvious — but the subtle correctness points. Why the forward pass is two matmuls instead of one. Why the initialisation has to be asymmetric. How the recursive injector handles arbitrary model architectures. And what the merge function lets you do that the wrapper version cannot.
The LoRALinear Wrapper
The core class wraps any existing nn.Linear with two trainable low-rank matrices. The base linear's weight remains frozen; only $A$ and $B$ receive gradients.
class LoRALinear(nn.Module):
def __init__(self, base: nn.Linear, rank: int = 4, alpha: float = 8.0):
super().__init__()
self.base = base
for p in self.base.parameters():
p.requires_grad_(False)
in_features = base.in_features
out_features = base.out_features
self.rank = rank
self.scale = alpha / rank
self.A = nn.Parameter(torch.randn(rank, in_features) / rank)
self.B = nn.Parameter(torch.zeros(out_features, rank))
def forward(self, x):
base_out = self.base(x)
lora_out = (x @ self.A.T) @ self.B.T * self.scale
return base_out + lora_out
Three design choices are worth slowing down on.
The base is frozen via requires_grad_(False). Setting requires_grad = False on a parameter tells PyTorch's autograd to not compute or store gradients for it. The parameter remains a valid tensor — you can read its values, use it in forward passes, save it in checkpoints — it just doesn't get updated by the optimizer. The optimizer ignores it because we filter parameters by p.requires_grad when constructing the optimizer (see usage below).
Asymmetric initialisation: $A$ random, $B$ zero. The product $BA$ at initialisation is $B \cdot A = 0 \cdot A = 0$. The LoRA-augmented model is therefore exactly identical to the frozen base model at step zero. Gradient descent then nudges $A$ and $B$ to capture whatever delta is needed for the new task. If both $A$ and $B$ were random, the initial $BA$ would be random and the model would start at a worse-than-base configuration — the optimizer would have to spend early iterations un-doing the random damage. The asymmetric initialisation removes this overhead entirely.
The forward pass is two matmuls, factored through the rank-$r$ intermediate. The line (x @ self.A.T) @ self.B.T * self.scale first projects $x$ into the rank-$r$ space (cost $O(r \cdot d_\text{in})$), then projects back to the output space (cost $O(r \cdot d_\text{out})$). Total: $O(r(d_\text{in} + d_\text{out}))$ — much smaller than the $O(d_\text{in} d_\text{out})$ that materialising the dense $BA$ matrix would require.
This factoring is essential, not cosmetic. If we wrote the forward pass as x @ (self.B @ self.A).T * self.scale instead, we would briefly construct a $d_\text{out} \times d_\text{in}$ tensor — exactly the size of the full weight matrix we are trying to avoid. The LoRA memory savings rely on never materialising that intermediate.
The Alpha/r Scaling Trick
The self.scale = alpha / rank line implements the LoRA scaling factor:
$\alpha$ is a hyperparameter typically set to $2r$ or to a fixed value like $16$. The division by $r$ might look arbitrary; it's actually load-bearing.
Consider what happens if you change the rank from $4$ to $8$. With twice the rank, the matrix $BA$ has roughly twice the typical magnitude (more dimensions to accumulate signal in). If you kept the same learning rate, your effective parameter updates would be roughly twice as large for rank-$8$ as for rank-$4$. This means changing the rank would force you to retune the learning rate.
The $\alpha/r$ scaling decouples this: doubling the rank halves the scale, keeping the effective magnitude of $BA \cdot x$ roughly constant. The same learning rate works across rank choices. This makes ablating rank a much cheaper experiment because you don't have to re-tune $\eta$ each time.
Why the Two-Matmul Order Matters
One more subtle point about the forward pass. x @ self.A.T has shape $(B, T, r)$ — a small projection. (x @ self.A.T) @ self.B.T has shape $(B, T, d_\text{out})$.
If you reversed the order — first computed self.B @ self.A, then multiplied by $x$ — you would briefly hold a $d_\text{out} \times d_\text{in}$ tensor in memory. For a $4096 \times 4096$ Linear with rank $8$, that intermediate is $16$M entries instead of the $8$K entries you actually need. On a $70$B-parameter model with LoRA applied to every attention projection, the difference between these two formulations is roughly $50$ GB of avoidable memory pressure.
PyTorch will not warn you about this. It is a correctness-equivalent transformation that destroys the memory advantage. Always factor through the rank-$r$ intermediate.
The Recursive Injector
To apply LoRA to an existing model, we need to traverse the model tree and replace every nn.Linear with its LoRA-wrapped version. The injector does this recursively:
def inject_lora(module, target_class=nn.Linear, rank=4, alpha=8.0):
injected = []
for name, child in list(module.named_children()):
if isinstance(child, target_class):
lora = LoRALinear(child, rank=rank, alpha=alpha)
setattr(module, name, lora)
injected.append(lora)
else:
injected.extend(inject_lora(child, target_class, rank, alpha))
return injected
Three things to notice.
The list(module.named_children()). We materialise the children iterator into a list before iterating. This is necessary because setattr on the module modifies the underlying _modules OrderedDict, which would invalidate a live iterator. The list copy is a defensive measure that avoids a subtle iterator-invalidation bug.
setattr(module, name, lora) as the replacement mechanism. PyTorch's module system uses Python's attribute lookup: module.my_linear looks up module._modules['my_linear'] automatically. Setting module.my_linear = lora replaces the child module in place, and subsequent forward passes pick up the new wrapped version. No special PyTorch API needed.
The target_class argument lets you restrict where LoRA gets applied. In real Transformer fine-tuning, you usually want LoRA only on attention projections, not on MLP layers (which would consume too much parameter budget). The Hugging Face PEFT library uses a similar mechanism with more configurable target filtering (e.g., regex match on layer names). Our version is simpler: any nn.Linear gets wrapped.
Trainable-Parameter Bookkeeping
Three small helpers handle the parameter inventory:
def count_trainable(module):
return sum(p.numel() for p in module.parameters() if p.requires_grad)
def count_total(module):
return sum(p.numel() for p in module.parameters())
def freeze_all(module):
for p in module.parameters():
p.requires_grad_(False)
The pattern that ties everything together looks like this:
base_model = load_pretrained() # all params trainable by default
freeze_all(base_model) # nothing trainable
inject_lora(base_model, rank=4, alpha=8) # A, B in each layer trainable
opt = torch.optim.AdamW(
p for p in base_model.parameters() if p.requires_grad
)
train(base_model, target_data)
The optimizer construction is the key line. By filtering parameters() with p.requires_grad, the optimizer only manages the LoRA parameters $A$ and $B$ — not the frozen base. This is what cuts the optimizer state (AdamW's $m$ and $v$ buffers) from "billions of entries" down to "tens of millions."
If you forget the filter — say, just torch.optim.AdamW(base_model.parameters()) — the optimizer will track moment buffers for the frozen parameters. Those parameters have requires_grad=False so they will never actually receive gradients, but the optimizer will still allocate AdamW state for them. That eats your memory savings entirely. The filter is non-optional.
Merging at Inference Time
At inference, the LoRA path adds one extra matmul to the forward pass. For most use cases this overhead is acceptable. But there is a way to eliminate it entirely: fold $BA$ into the base weight before inference.
def merge_lora(lora_linear: LoRALinear) -> nn.Linear:
base = lora_linear.base
delta = lora_linear.scale * (lora_linear.B @ lora_linear.A)
base.weight.data += delta
return base
Five lines. After merging, the model looks structurally identical to the pretrained base — same architecture, same parameter count, same inference latency. The LoRA adapter has been absorbed into the base weight.
This is the design property that makes LoRA fundamentally different from earlier "adapter" methods (Houlsby adapters, prefix tuning). Those approaches insert small modules into the model that persist at inference time, adding latency proportional to the number of adapter modules. LoRA's BA delta can be folded into the existing weight, so inference cost is exactly the same as the base model.
One caveat about merging. Once you merge, you cannot un-merge — the base weight has been irreversibly changed. To support task switching at serving time (one base model, many adapters, switch between them), you keep the LoRA modules unmerged and toggle which adapter's $A$ and $B$ get used at each forward call. This is what Hugging Face PEFT's set_adapter mechanism does.
Why Freezing Matters Beyond Memory
The most obvious benefit of freezing is memory: AdamW's state for $70$B parameters is $\sim 280$ GB; for the $\sim 8$M LoRA parameters it's $\sim 32$ MB. That alone moves $70$B fine-tuning from "needs an H100 cluster" to "fits on a single $24$ GB consumer GPU."
But there is a second, less-obvious benefit: composability. Because the base is unchanged, two LoRAs trained on different tasks operate on the same underlying base model. You can sum them, interpolate between them, or load one at a time and switch at inference. This is what makes LoRA the dominant fine-tuning method for stackable image-generation workflows (Civitai, the entire Stable Diffusion ecosystem) — users compose dozens of LoRAs for different styles, characters, and effects, all anchored to the same base.
If you allow the base to drift even slightly during LoRA training, you lose this property. Two LoRAs trained against slightly different bases cannot be summed cleanly. Freezing is not a memory optimisation — it's what makes adapter-style fine-tuning a deployable production pattern.
What This Implementation Skips
Production LoRA implementations (Hugging Face PEFT, bitsandbytes' LoRA implementation) add several features we omit for clarity:
QLoRA quantisation. Dettmers et al. (2023) showed you can quantise the frozen base to 4-bit precision and still train LoRA adapters in higher precision. This makes the base model fit in much less memory ($70$B becomes $\sim 35$ GB at 4-bit) while LoRA training proceeds normally. Our implementation keeps the base in full precision; QLoRA-style quantisation is a separate piece of engineering.
Mixed-precision LoRA. Real implementations train LoRA in bf16/fp16 with selective fp32 accumulation for stability. We use whatever precision PyTorch defaults to.
Configurable target filtering. PEFT exposes regex-based target filtering, separate LoRA configurations per layer type, parameter-efficient fine-tuning bookkeeping, etc. Our $\sim 50$ lines handle the algorithm but not the configuration surface.
Multi-adapter management. Loading and switching between multiple LoRAs at serving time requires explicit state management. We support one adapter per model. PEFT's PeftModel supports many.
The Whole Implementation
Counting only substantive code: the wrapper is $20$ lines, the injector is $10$ lines, the helpers are $6$ lines, the merge function is $5$ lines. Total: roughly $40$ lines. With docstrings and a thin module structure, the full lora.py file is around $50$ lines.
What you can do with this: take any model containing nn.Linear layers, freeze it, inject LoRA, train. What you cannot do: handle the production-deployment surface area that PEFT covers. For learning, the simpler version is the right entry point.
What Part 3 Tests
With LoRA in hand, Part 3 runs a controlled adaptation experiment: pretrain a $4$-layer MLP on standard two-moons; define a target task as the same data rotated by $45^\circ$; and compare full fine-tuning against LoRA at ranks $2$, $4$, $8$. The result is that rank-$2$ LoRA — $1{,}544$ trainable parameters, $4.59\%$ of the full parameter count — matches or beats full fine-tuning. The intrinsic-rank hypothesis from Part 1 holds up empirically.
Full code on GitHub: github.com/soveshmohapatra/LoRA