Deconstructing CNNs: Part 1 - The Math of Convolutions

Introduction

Before CNNs, computer vision relied on hand-crafted features — SIFT, HOG, and edge detectors designed by human experts. These methods were brittle, domain-specific, and failed to generalize.

CNNs changed the game by learning features directly from data. Instead of manually designing filters, the network learns them through backpropagation, building up hierarchical representations from edges to textures to object parts.

This is part 1 of a 3-part series where I deconstruct CNNs from first principles. Here we cover the math: convolution, pooling, and the architectural principles that give CNNs their edge. Part 2 builds the architectures in PyTorch. Part 3 trains on MNIST and visualizes what the filters actually learn.

The Intuition Behind Convolution

Slide a small window (a kernel or filter) across an image. At each position, compute a weighted sum of pixel values. Depending on the kernel weights, this picks up vertical edges, horizontal edges, color contrasts, and so on.

The key insight: these kernel weights are learned during training, not hand-designed.

The Convolution Operation

Mathematically, 2D convolution is defined as:

(I * K)_{i,j} = \sum_m \sum_n I_{i-m, j-n} \cdot K_{m,n}

where $I$ is the input image and $K$ is the kernel.

In practice, we use cross-correlation (no kernel flipping), which is simpler and learns equivalently:

(I \star K)_{i,j} = \sum_m \sum_n I_{i+m, j+n} \cdot K_{m,n}

Output Dimensions

The spatial dimensions of the output depend on the kernel size $k$, stride $s$, and padding $p$. For an input of size $W$:

W_{\text{out}} = \left\lfloor \frac{W - k + 2p}{s} \right\rfloor + 1

A 28x28 MNIST image convolved with a 3x3 kernel at stride 1 and padding 1 produces a 28x28 output -- the spatial dimensions are preserved. Without padding, the output shrinks to 26x26 because the kernel cannot center on border pixels. Stacking several unpadded convolutions compounds this shrinkage, so most architectures use padding = kernel_size // 2 to maintain resolution until an explicit pooling layer halves it.

Key Architectural Principles

Local Connectivity

Unlike fully connected layers, each convolutional neuron only sees a local region of the input — a small $k \times k$ patch, not the entire image. This encodes the prior that nearby pixels matter more than distant ones.

Weight Sharing

The same kernel slides across the entire image, which cuts parameters drastically and gives you translation invariance — a feature detected in one location can be detected anywhere. For example, a 3x3 convolution with 64 input and 64 output channels has only:

64 \times 64 \times 3 \times 3 = 36{,}864 \text{ parameters}

A fully connected layer connecting the same dimensions would need millions.

Pooling

Pooling layers reduce spatial dimensions, providing computational efficiency, translation invariance, and expanded receptive fields for deeper layers.

Max pooling selects the maximum value in each window:

\text{MaxPool}(x)_{i,j} = \max_{(m,n) \in \text{window}} x_{i+m, j+n}

Average pooling computes the mean instead, smoothing the representation. LeNet-5 (1998) originally used average pooling, but max pooling dominates modern architectures because it preserves the strongest activations and tends to produce sharper gradients during backpropagation.

Receptive Fields

The receptive field of a neuron is the region of the original input that influences its activation. In a single 3x3 convolutional layer, each output neuron sees a 3x3 patch. Stack two such layers and the effective receptive field grows to 5x5. Stack three and it reaches 7x7. Pooling layers accelerate this expansion -- a 2x2 max pool doubles the effective receptive field of every subsequent layer. This is why deep CNNs can capture global context despite using small local kernels: each layer aggregates information from a progressively wider region of the input.

Hierarchical Feature Learning

Stacked convolutional layers learn a hierarchy of features:

Early layers: Edges, corners, simple textures
Middle layers: Combinations of edges into motifs and patterns
Deep layers: Object parts and semantic concepts

Each layer composes the abstractions of the previous one. This compositionality is what makes CNNs effective for visual recognition. A network does not need to learn a "7" detector from scratch -- it learns vertical strokes and horizontal bars in early layers, then composes them into angle and intersection detectors, which together fire for the digit 7. This compositional reuse is also why CNNs generalize across variations in handwriting style, scale, and position.

What Comes Next

Convolution plus weight sharing plus hierarchical processing gives CNNs an inductive bias well-matched to visual data: locality, translation equivariance, and compositionality.

Part 2 implements all of this in PyTorch — custom Conv2D layers, pooling, and full architectures including LeNet-5 and a VGG-style deep network.

Deconstructing CNNs from Scratch

Part 1: The Math of Convolutions