Deconstructing ResNets: Part 1 - The Math of Residual Learning

Introduction

In 2015, He et al. at Microsoft Research noticed something that shouldn't happen: a 56-layer convolutional network had higher training error than a 20-layer one. Not higher test error -- higher training error. This wasn't overfitting. The optimizer was simply failing to find a solution that provably existed.

Their fix was to stop asking layers to learn full mappings and instead have them learn residuals -- small corrections to the input, passed through via identity skip connections. That single change enabled networks with hundreds of layers and won ImageNet 2015 at 3.57% top-5 error.

This is Part 1 of a 3-part series where I build ResNets from scratch. Here we cover the math: the degradation problem, the residual formulation, and why skip connections fix gradient flow. Part 2 is the PyTorch implementation. Part 3 is training on CIFAR-10 with activation visualizations.

The Degradation Problem

Consider stacking more layers onto a network that already works. You'd expect:

More layers = more representational capacity
Extra layers could just learn identity mappings if they aren't needed
Performance should at least stay the same

In practice, deep plain networks hit three problems:

Degradation: Training accuracy drops past ~20 layers
Vanishing gradients: Gradients shrink through many multiplicative layers
Optimization difficulty: The loss landscape becomes harder to navigate

The important point: this is not overfitting. A 56-layer plain network has higher training error than a 20-layer one. The solution exists (a 56-layer net could just copy the 20-layer solution and set the extra 36 layers to identity), but SGD cannot find it.

The Residual Learning Insight

Instead of having each block learn a direct mapping $H(x)$, ResNets reframe the problem. Learn a residual function:

$$ F(x) = H(x) - x $$

So the layer output is:

$$ y = F(x) + x $$

$F(x)$ is what the layer needs to add to the input. $x$ passes through unchanged via the skip connection.

Why Residual Learning Works

Easier to Learn Identity

If the optimal function is close to identity, the two formulations are not equally easy to optimize:

Plain network: Must learn $H(x) \approx x$ through nonlinear layers (non-trivial)
Residual network: Learn $F(x) \approx 0$ (easy -- push weights toward zero)

This is why deeper ResNets never perform worse than shallower ones: unnecessary layers can just learn zero residuals and pass input through.

Gradient Highway

During backpropagation, the gradient through a residual block is:

\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \left(\frac{\partial F}{\partial x} + 1\right)

That $+1$ from the skip connection is the whole trick. Even if $\frac{\partial F}{\partial x}$ vanishes, gradients still flow backward through the identity path. Same principle as the Constant Error Carousel in LSTMs.

Ensemble Interpretation

Veit et al. (2016) showed that unraveling the skip connections reveals $2^n$ possible paths through $n$ blocks. A ResNet effectively operates as an ensemble of many shallower networks of different depths -- information can bypass any subset of layers through the shortcuts.

The Residual Block

A basic residual block consists of:

Two $3 \times 3$ convolutional layers
Batch normalization after each convolution
ReLU activation after the first convolution
Identity skip connection (element-wise addition)
Final ReLU after the addition

When spatial dimensions or channel counts change between blocks, a $1 \times 1$ convolution with appropriate stride handles the dimension mismatch in the skip path.

Comparison to Plain Networks

Plain Network: $x \rightarrow \text{Conv} \rightarrow \text{BN} \rightarrow \text{ReLU} \rightarrow \text{Conv} \rightarrow \text{BN} \rightarrow \text{ReLU} \rightarrow y$
Residual Network: $x \rightarrow \text{Conv} \rightarrow \text{BN} \rightarrow \text{ReLU} \rightarrow \text{Conv} \rightarrow \text{BN} \rightarrow (+x) \rightarrow \text{ReLU} \rightarrow y$

The only structural difference is adding $x$ before the final activation. One line of code. That's what takes you from 20-layer ceilings to 1000+ layer networks.

Next: From Math to Code

That covers the mathematical foundation -- residual functions, gradient highways, and the ensemble view. Part 2 turns all of this into a working PyTorch implementation: ResidualBlock, BottleneckBlock, the full ResNet-18/34/50/101/152 family, and a SmallResNet variant sized for CIFAR-10.

Deconstructing ResNets from Scratch

Part 1: The Math of Residual Learning