In Part 1, we established why residual learning works — learning deviations from identity creates gradient highways that enable training of arbitrarily deep networks.
Today, we take that math and translate it into a pure PyTorch implementation. We will build from individual residual blocks up to the full ResNet-18/34/50/101/152 family, plus a SmallResNet variant optimized for CIFAR-10.
The Residual Block
Our ResidualBlock implements the core $y = F(x) + x$ pattern: two convolutions with a skip connection that bypasses both. The critical detail: ReLU comes after the addition, not before — this ensures the skip connection passes through cleanly.
class ResidualBlock(nn.Module):
expansion = 1 # Output channels = in_channels * expansion
def __init__(self, in_channels, out_channels, stride=1, downsample=None):
super().__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3,
stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
stride=1, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_channels)
self.downsample = downsample
def forward(self, x):
identity = x
out = F.relu(self.bn1(self.conv1(x))) # 3x3 conv -> BN -> ReLU
out = self.bn2(self.conv2(out)) # 3x3 conv -> BN (no ReLU yet)
if self.downsample is not None:
identity = self.downsample(x)
out += identity # Skip connection!
out = F.relu(out) # Final ReLU after addition
return out
Handling Dimension Changes
When spatial dimensions or channel counts change between layers, the identity shortcut needs a $1 \times 1$ convolution to match:
# Create downsampling for skip connection if needed
if stride != 1 or self.in_channels != out_channels * block.expansion:
downsample = nn.Sequential(
nn.Conv2d(self.in_channels, out_channels * block.expansion,
kernel_size=1, stride=stride, bias=False),
nn.BatchNorm2d(out_channels * block.expansion),
)
Bottleneck Blocks
For deeper ResNets (50+), direct $3 \times 3$ convolutions become computationally expensive at higher channel counts. Bottleneck blocks solve this with a $1 \times 1 \rightarrow 3 \times 3 \rightarrow 1 \times 1$ pattern:
- $1 \times 1$ conv: Reduce channels (e.g., 256 → 64) — "squeeze"
- $3 \times 3$ conv: Process at reduced channels — the expensive operation
- $1 \times 1$ conv: Restore channels (64 → 256) — "expand"
class BottleneckBlock(nn.Module):
expansion = 4 # Output = in_channels * 4
def __init__(self, in_channels, out_channels, stride=1, downsample=None):
super().__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=1,
stride=1, bias=False) # Squeeze
self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
stride=stride, padding=1, bias=False) # Process
self.conv3 = nn.Conv2d(out_channels, out_channels * self.expansion,
kernel_size=1, stride=1, bias=False) # Expand
def forward(self, x):
identity = x
out = F.relu(self.bn1(self.conv1(x))) # 1x1 squeeze
out = F.relu(self.bn2(self.conv2(out))) # 3x3 process
out = self.bn3(self.conv3(out)) # 1x1 expand (no ReLU)
if self.downsample is not None:
identity = self.downsample(x)
out += identity
return F.relu(out)
The ResNet Family
All ResNets share the same macro-architecture but differ in block type and depth:
ResNet-18/34 (Basic Blocks)
ResNet-18: [2, 2, 2, 2] basic blocks # 11.7M params
ResNet-34: [3, 4, 6, 3] basic blocks # 21.8M params
ResNet-50/101/152 (Bottleneck Blocks)
ResNet-50: [3, 4, 6, 3] bottleneck # 25.6M params
ResNet-101: [3, 4, 23, 3] bottleneck # 44.5M params
ResNet-152: [3, 8, 36, 3] bottleneck # 60.2M params
SmallResNet for CIFAR-10
For smaller $32 \times 32$ images, the standard ResNet stem ($7 \times 7$ conv + maxpool) would reduce spatial dimensions too aggressively. Our SmallResNet adapts:
- Replaces $7 \times 7$ conv with $3 \times 3$ conv (stride=1)
- Removes the initial max pooling layer
- Uses 3 residual stages instead of 4
- Starts with 16 channels instead of 64
class SmallResNet(nn.Module):
def __init__(self, block, blocks_per_layer, num_classes=10,
in_channels=3, initial_channels=16):
super().__init__()
self.in_channels = initial_channels
# 3x3 stem (no maxpool for small images)
self.conv1 = nn.Conv2d(in_channels, initial_channels, kernel_size=3,
stride=1, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(initial_channels)
# 3 residual stages (not 4)
self.layer1 = self._make_layer(block, initial_channels, blocks_per_layer[0], stride=1)
self.layer2 = self._make_layer(block, initial_channels * 2, blocks_per_layer[1], stride=2)
self.layer3 = self._make_layer(block, initial_channels * 4, blocks_per_layer[2], stride=2)
# Global average pooling + classifier
self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(initial_channels * 4 * block.expansion, num_classes)
Weight Initialization
We use He (Kaiming) initialization for all convolutions — this is critical for ResNets. Standard Xavier initialization underestimates the variance needed after ReLU (which zeros half the distribution):
def _initialize_weights(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
Training Strategy
We train our SmallResNet-18 on CIFAR-10 with:
- Data augmentation: Random horizontal flip + random crop with 4px padding
- SGD with momentum (0.9) and weight decay ($5 \times 10^{-4}$)
- Learning rate scheduling: Start at 0.1, decay $\times 0.1$ at epochs 15 and 25
- Batch size: 128
model = SmallResNet18(num_classes=10, in_channels=3).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
scheduler = optim.lr_scheduler.MultiStepLR(optimizer, milestones=[15, 25], gamma=0.1)
Next Steps: Training Deep Networks
With our ResNet implemented, we can now train and visualize how skip connections enable gradient flow through deep architectures. In Part 3, we will benchmark on CIFAR-10, visualize activation statistics layer-by-layer, and inspect learned feature maps.