In 2015, He et al. at Microsoft Research noticed something that shouldn't happen: a 56-layer convolutional network had higher training error than a 20-layer one. Not higher test error -- higher training error. This wasn't overfitting. The optimizer was simply failing to find a solution that provably existed.
Their fix was to stop asking layers to learn full mappings and instead have them learn residuals -- small corrections to the input, passed through via identity skip connections. That single change enabled networks with hundreds of layers and won ImageNet 2015 at 3.57% top-5 error.
This is Part 1 of a 3-part series where I build ResNets from scratch. Here we cover the math: the degradation problem, the residual formulation, and why skip connections fix gradient flow. Part 2 is the PyTorch implementation. Part 3 is training on CIFAR-10 with activation visualizations.
The Degradation Problem
Consider stacking more layers onto a network that already works. You'd expect:
- More layers = more representational capacity
- Extra layers could just learn identity mappings if they aren't needed
- Performance should at least stay the same
In practice, deep plain networks hit three problems:
- Degradation: Training accuracy drops past ~20 layers
- Vanishing gradients: Gradients shrink through many multiplicative layers
- Optimization difficulty: The loss landscape becomes harder to navigate
The important point: this is not overfitting. A 56-layer plain network has higher training error than a 20-layer one. The solution exists (a 56-layer net could just copy the 20-layer solution and set the extra 36 layers to identity), but SGD cannot find it.
The Residual Learning Insight
Instead of having each block learn a direct mapping $H(x)$, ResNets reframe the problem. Learn a residual function:
So the layer output is:
$F(x)$ is what the layer needs to add to the input. $x$ passes through unchanged via the skip connection.
Why Residual Learning Works
Easier to Learn Identity
If the optimal function is close to identity, the two formulations are not equally easy to optimize:
- Plain network: Must learn $H(x) \approx x$ through nonlinear layers (non-trivial)
- Residual network: Learn $F(x) \approx 0$ (easy -- push weights toward zero)
This is why deeper ResNets never perform worse than shallower ones: unnecessary layers can just learn zero residuals and pass input through.
Gradient Highway
During backpropagation, the gradient through a residual block is:
That $+1$ from the skip connection is the whole trick. Even if $\frac{\partial F}{\partial x}$ vanishes, gradients still flow backward through the identity path. Same principle as the Constant Error Carousel in LSTMs.
Ensemble Interpretation
Veit et al. (2016) showed that unraveling the skip connections reveals $2^n$ possible paths through $n$ blocks. A ResNet effectively operates as an ensemble of many shallower networks of different depths -- information can bypass any subset of layers through the shortcuts.
The Residual Block
A basic residual block consists of:
- Two $3 \times 3$ convolutional layers
- Batch normalization after each convolution
- ReLU activation after the first convolution
- Identity skip connection (element-wise addition)
- Final ReLU after the addition
When spatial dimensions or channel counts change between blocks, a $1 \times 1$ convolution with appropriate stride handles the dimension mismatch in the skip path.
Comparison to Plain Networks
- Plain Network: $x \rightarrow \text{Conv} \rightarrow \text{BN} \rightarrow \text{ReLU} \rightarrow \text{Conv} \rightarrow \text{BN} \rightarrow \text{ReLU} \rightarrow y$
- Residual Network: $x \rightarrow \text{Conv} \rightarrow \text{BN} \rightarrow \text{ReLU} \rightarrow \text{Conv} \rightarrow \text{BN} \rightarrow (+x) \rightarrow \text{ReLU} \rightarrow y$
The only structural difference is adding $x$ before the final activation. One line of code. That's what takes you from 20-layer ceilings to 1000+ layer networks.
Next: From Math to Code
That covers the mathematical foundation -- residual functions, gradient highways, and the ensemble view. Part 2 turns all of this into a working PyTorch implementation: ResidualBlock, BottleneckBlock, the full ResNet-18/34/50/101/152 family, and a SmallResNet variant sized for CIFAR-10.