In 2015, a curious observation challenged deep learning orthodoxy: deeper networks should perform better, but they didn't. As networks grew beyond 20 layers, accuracy saturated and then degraded rapidly. This wasn't overfitting — training error also increased.
Residual Networks, introduced by He et al. from Microsoft Research, solved this problem with an elegantly simple idea: instead of learning direct mappings, learn residual functions with reference to the layer inputs, using identity skip connections. This single idea enabled training networks with hundreds — even thousands — of layers, and won ImageNet 2015 with a 3.57% top-5 error rate.
In this new 3-part "Build in Public" series, we will deconstruct Residual Networks from first principles. Today, we look at the mathematical foundation: the degradation problem, the residual learning formulation, and why skip connections enable gradient flow through arbitrarily deep networks. In Part 2, we will build the complete architecture in PyTorch. In Part 3, we will train on CIFAR-10 and visualize activation patterns.
The Degradation Problem
Consider stacking more layers onto a network that already works. Intuitively:
- Deeper networks have more representational capacity
- Additional layers could learn identity mappings if not needed
- Performance should improve (or at least not degrade)
But empirically, deep plain networks suffer from:
- Degradation: Training accuracy decreases with depth beyond ~20 layers
- Vanishing gradients: Gradients diminish through many multiplicative layers
- Optimization difficulty: The loss landscape becomes harder to navigate
The key insight is that this is not an overfitting problem — a 56-layer network has higher training error than a 20-layer one. The optimizer simply cannot find the solution, even though it provably exists.
The Residual Learning Insight
Instead of asking each block to learn a direct mapping $H(x)$, ResNets reframe the problem: learn a residual function:
The layer output becomes:
where $F(x)$ is the residual (what the layer needs to add to the input) and $x$ is the identity (passed through unchanged via the skip connection).
Why Residual Learning Works
Easier to Learn Identity
If the optimal function is close to identity, residual learning makes this trivial:
- Plain network: Must learn $H(x) \approx x$ through nonlinear layers (non-trivial)
- Residual network: Learn $F(x) \approx 0$ (easy — push weights toward zero)
This reframing explains why deeper ResNets never perform worse than shallower ones: extra layers can simply learn zero residuals.
Gradient Highway
During backpropagation, the gradient through a residual block is:
The $+1$ term from the skip connection ensures gradients can always flow backward, even if $\frac{\partial F}{\partial x}$ vanishes. This is the same principle as the "Constant Error Carousel" in LSTMs — identity paths create gradient highways.
Ensemble Interpretation
Veit et al. (2016) showed that ResNets can be viewed as an ensemble of many paths of different lengths. Unraveling the skip connections reveals $2^n$ possible paths through $n$ blocks. Information can bypass layers through shortcuts, effectively creating an ensemble of shallower networks within a deep architecture.
The Residual Block
A basic residual block consists of:
- Two $3 \times 3$ convolutional layers
- Batch normalization after each convolution
- ReLU activation after the first convolution
- Identity skip connection (element-wise addition)
- Final ReLU after the addition
When spatial dimensions or channel counts change between blocks, a $1 \times 1$ convolution with appropriate stride is used in the skip connection to match dimensions.
Comparison to Plain Networks
- Plain Network: $x \rightarrow \text{Conv} \rightarrow \text{BN} \rightarrow \text{ReLU} \rightarrow \text{Conv} \rightarrow \text{BN} \rightarrow \text{ReLU} \rightarrow y$
- Residual Network: $x \rightarrow \text{Conv} \rightarrow \text{BN} \rightarrow \text{ReLU} \rightarrow \text{Conv} \rightarrow \text{BN} \rightarrow (+x) \rightarrow \text{ReLU} \rightarrow y$
The only structural difference is the addition of $x$ before the final activation. This minimal change — literally one line of code — enables training networks $10\times$ deeper.
Next Steps: From Math to Code
We have established the mathematical foundation of residual learning: learning deviations from identity rather than direct mappings, creating gradient highways through skip connections.
In Part 2, we implement this exact formulation in pure PyTorch. We will build from a single ResidualBlock up through BottleneckBlocks, and construct the full ResNet-18/34/50/101/152 family plus a SmallResNet variant optimized for CIFAR-10.
Stay tuned for the code drop as we build ResNets from scratch!