RNNs from Scratch: Part 1 - The Math of Recurrence

Introduction

Before Transformers took over sequence modeling, Recurrent Neural Networks were the go-to architecture for temporal data. Where feedforward networks treat each input independently, RNNs carry an internal state forward through time, letting them pick up on dependencies between time steps.

This post covers the core math behind RNNs: the recurrence relation, backpropagation through time (BPTT), and the vanishing gradient problem that ultimately limits them.

The Intuition Behind Recurrence

Think about reading a sentence one word at a time. Each word shifts your understanding based on everything you have read so far. RNNs do something analogous: they maintain a hidden state vector that gets updated at each time step, accumulating information as the sequence progresses.

A feedforward network maps a fixed-size input to a fixed-size output. It has no notion of order or history. An RNN, by contrast, processes inputs sequentially and threads a state vector through the entire sequence. That state vector is what gives the network memory -- it is a compressed summary of everything the network has seen up to the current step.

The Recurrence Relation

At each time step $t$, an RNN:

Receives input $x_t$
Combines it with the previous hidden state $h_{t-1}$
Produces a new hidden state $h_t$

The update equation is:

h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h)

where:

$W_{xh}$: Input-to-hidden weights
$W_{hh}$: Hidden-to-hidden (recurrent) weights
$\tanh$: Non-linearity that squashes values to $[-1, 1]$

Parameter Sharing Across Time

The same weights $(W_{xh}, W_{hh}, b_h)$ are reused at every time step. This parameter sharing has two practical consequences:

The model can generalize to sequences of different lengths
The number of parameters is independent of sequence length

This is a fundamentally different strategy from, say, a 1D convolution with a fixed receptive field, or a fully connected layer that has separate parameters for each position. Our implementation in PyTorch uses just 13,314 trainable parameters to process sequences of length 20 -- and the same weights would work on sequences of length 100 or 1,000 without any architectural change.

Backpropagation Through Time (BPTT)

Training RNNs requires computing gradients through the entire sequence. The chain rule applied across time steps gives:

\frac{\partial L}{\partial W_{hh}} = \sum_t \frac{\partial L}{\partial h_t} \cdot \frac{\partial h_t}{\partial W_{hh}}

Because $h_t$ depends on $h_{t-1}$, which depends on $h_{t-2}$, and so on, gradients must flow backward through the entire chain of dependencies. This is what makes RNN training fundamentally different from training feedforward networks: the computational graph is unrolled across time, and every time step introduces another node that gradients must pass through.

The Vanishing Gradient Problem

When we unroll the recurrence, we see the issue:

\frac{\partial h_t}{\partial h_0} = \prod_{k=1}^{t} \frac{\partial h_k}{\partial h_{k-1}} = \prod_{k=1}^{t} W_{hh}^T \cdot \text{diag}(1 - \tanh^2(\cdot))

When the eigenvalues of $W_{hh}$ sit below 1, this product shrinks exponentially with sequence length. The result: gradients from early time steps effectively disappear, and the network cannot learn long-range dependencies.

The flip side exists too. If the eigenvalues exceed 1, gradients explode -- they grow exponentially and destabilize training. Gradient clipping can mitigate exploding gradients, but there is no equally clean fix for the vanishing case. This asymmetry is what makes vanilla RNNs unreliable on long sequences and is the core motivation behind gated architectures like LSTMs and GRUs.

Why RNNs Still Matter

Despite being overshadowed by Transformers, RNNs remain relevant:

Efficiency: Process sequences incrementally (online inference)
Simplicity: Fewer parameters than attention-based models
Foundation: Understanding RNNs is prerequisite to LSTMs and GRUs

What Comes Next

The recurrence relation gives RNNs a straightforward way to model sequences, but vanishing gradients cap how far back they can actually learn from. In Part 2, we translate these equations into PyTorch code and build out several architecture variants.