Deconstructing CNNs: Part 1 - The Math of Convolutions

Introduction

Before Convolutional Neural Networks, computer vision relied on hand-crafted features — SIFT, HOG, and edge detectors designed by human experts. These methods were brittle, domain-specific, and failed to generalize.

CNNs changed everything by learning features directly from data. Instead of manually designing filters, CNNs learn them automatically through backpropagation, discovering hierarchical representations from edges to textures to object parts.

In this new 3-part "Build in Public" series, we will deconstruct Convolutional Neural Networks from first principles. Today, we look at the mathematical foundation: the convolution operation, pooling, and the architectural principles that make CNNs powerful. In Part 2, we will build complete CNN architectures in pure PyTorch. In Part 3, we will train on MNIST and visualize the learned filters and feature maps.

The Intuition Behind Convolution

Imagine sliding a small window (called a kernel or filter) across an image. At each position, you compute a weighted sum of pixel values. This operation detects specific patterns — vertical edges, horizontal edges, or color contrasts — depending on the kernel's weights.

The magic of CNNs is that these kernel weights are learned during training, not hand-designed.

The Convolution Operation

Mathematically, 2D convolution is defined as:

(I * K)_{i,j} = \sum_m \sum_n I_{i-m, j-n} \cdot K_{m,n}

where $I$ is the input image and $K$ is the kernel.

In practice, we use cross-correlation (no flipping), which is simpler and learns equivalently:

(I \star K)_{i,j} = \sum_m \sum_n I_{i+m, j+n} \cdot K_{m,n}

For an input of size $H \times W$ with a kernel of size $k$, the output spatial dimensions are:

H_{out} = \left\lfloor\frac{H + 2 \cdot \text{padding} - \text{dilation} \cdot (k - 1) - 1}{\text{stride}} + 1\right\rfloor

Key Architectural Principles

Local Connectivity

Unlike fully connected layers, convolutional neurons only connect to a local region of the input. This captures the intuition that nearby pixels are more related than distant ones. A single neuron in a convolutional layer sees only a small $k \times k$ patch, not the entire image.

Weight Sharing

The same kernel slides across the entire image. This dramatically reduces parameters and enables translation invariance — a feature detected in one location can be detected anywhere. A 3x3 convolution with 64 input and output channels has only:

64 \times 64 \times 3 \times 3 = 36{,}864 \text{ parameters}

Compare this to a fully connected layer with the same dimensions, which would have millions.

Pooling

Pooling layers reduce spatial dimensions, providing:

Computational efficiency
Translation invariance
Expanded receptive fields for deeper layers

Max pooling, the most common variant, selects the maximum value in each window:

\text{MaxPool}(x)_{i,j} = \max_{(m,n) \in \text{window}} x_{i+m, j+n}

Average pooling instead computes the mean, smoothing the representation rather than preserving the strongest activations.

Hierarchical Feature Learning

Deep CNNs learn a hierarchy of features through stacked convolutional layers:

Early layers: Detect edges, corners, and simple textures
Middle layers: Combine edges into motifs and patterns
Deep layers: Recognize object parts and semantic concepts

This hierarchical representation is what makes CNNs so powerful for visual recognition. Each layer builds on the abstractions of the previous one, composing simple features into increasingly complex detectors.

Activation Functions

After each convolution, a non-linear activation function is applied. The ReLU (Rectified Linear Unit) is the default choice in modern CNNs:

\text{ReLU}(x) = \max(0, x)

ReLU introduces non-linearity while avoiding the vanishing gradient problem that plagued earlier activations like sigmoid and tanh. It creates sparse activations — most neurons are inactive for any given input — which is both computationally efficient and may aid generalization.

Next Steps: From Math to Code

The convolution operation, combined with weight sharing and hierarchical processing, provides CNNs with an inductive bias perfectly suited for visual data. These three principles — locality, translation equivariance, and compositionality — encode powerful assumptions about the structure of images.

In Part 2, we will implement these concepts in pure PyTorch. We will build a custom Conv2D layer, pooling operations, and complete CNN architectures including LeNet-5 and a VGG-style deep network.

Stay tuned for the code drop as we build CNNs from scratch!

Deconstructing CNNs from Scratch

Part 1: The Math of Convolutions