Before Convolutional Neural Networks, computer vision relied on hand-crafted features — SIFT, HOG, and edge detectors designed by human experts. These methods were brittle, domain-specific, and failed to generalize.
CNNs changed everything by learning features directly from data. Instead of manually designing filters, CNNs learn them automatically through backpropagation, discovering hierarchical representations from edges to textures to object parts.
In this new 3-part "Build in Public" series, we will deconstruct Convolutional Neural Networks from first principles. Today, we look at the mathematical foundation: the convolution operation, pooling, and the architectural principles that make CNNs powerful. In Part 2, we will build complete CNN architectures in pure PyTorch. In Part 3, we will train on MNIST and visualize the learned filters and feature maps.
The Intuition Behind Convolution
Imagine sliding a small window (called a kernel or filter) across an image. At each position, you compute a weighted sum of pixel values. This operation detects specific patterns — vertical edges, horizontal edges, or color contrasts — depending on the kernel's weights.
The magic of CNNs is that these kernel weights are learned during training, not hand-designed.
The Convolution Operation
Mathematically, 2D convolution is defined as:
where $I$ is the input image and $K$ is the kernel.
In practice, we use cross-correlation (no flipping), which is simpler and learns equivalently:
For an input of size $H \times W$ with a kernel of size $k$, the output spatial dimensions are:
Key Architectural Principles
Local Connectivity
Unlike fully connected layers, convolutional neurons only connect to a local region of the input. This captures the intuition that nearby pixels are more related than distant ones. A single neuron in a convolutional layer sees only a small $k \times k$ patch, not the entire image.
Weight Sharing
The same kernel slides across the entire image. This dramatically reduces parameters and enables translation invariance — a feature detected in one location can be detected anywhere. A 3x3 convolution with 64 input and output channels has only:
Compare this to a fully connected layer with the same dimensions, which would have millions.
Pooling
Pooling layers reduce spatial dimensions, providing:
- Computational efficiency
- Translation invariance
- Expanded receptive fields for deeper layers
Max pooling, the most common variant, selects the maximum value in each window:
Average pooling instead computes the mean, smoothing the representation rather than preserving the strongest activations.
Hierarchical Feature Learning
Deep CNNs learn a hierarchy of features through stacked convolutional layers:
- Early layers: Detect edges, corners, and simple textures
- Middle layers: Combine edges into motifs and patterns
- Deep layers: Recognize object parts and semantic concepts
This hierarchical representation is what makes CNNs so powerful for visual recognition. Each layer builds on the abstractions of the previous one, composing simple features into increasingly complex detectors.
Activation Functions
After each convolution, a non-linear activation function is applied. The ReLU (Rectified Linear Unit) is the default choice in modern CNNs:
ReLU introduces non-linearity while avoiding the vanishing gradient problem that plagued earlier activations like sigmoid and tanh. It creates sparse activations — most neurons are inactive for any given input — which is both computationally efficient and may aid generalization.
Next Steps: From Math to Code
The convolution operation, combined with weight sharing and hierarchical processing, provides CNNs with an inductive bias perfectly suited for visual data. These three principles — locality, translation equivariance, and compositionality — encode powerful assumptions about the structure of images.
In Part 2, we will implement these concepts in pure PyTorch. We will build a custom Conv2D layer, pooling operations, and complete CNN architectures including LeNet-5 and a VGG-style deep network.
Stay tuned for the code drop as we build CNNs from scratch!