Before Transformers took over sequence modeling, Recurrent Neural Networks were the go-to architecture for temporal data. Where feedforward networks treat each input independently, RNNs carry an internal state forward through time, letting them pick up on dependencies between time steps.
This post covers the core math behind RNNs: the recurrence relation, backpropagation through time (BPTT), and the vanishing gradient problem that ultimately limits them.
The Intuition Behind Recurrence
Think about reading a sentence one word at a time. Each word shifts your understanding based on everything you have read so far. RNNs do something analogous: they maintain a hidden state vector that gets updated at each time step, accumulating information as the sequence progresses.
A feedforward network maps a fixed-size input to a fixed-size output. It has no notion of order or history. An RNN, by contrast, processes inputs sequentially and threads a state vector through the entire sequence. That state vector is what gives the network memory -- it is a compressed summary of everything the network has seen up to the current step.
The Recurrence Relation
At each time step $t$, an RNN:
- Receives input $x_t$
- Combines it with the previous hidden state $h_{t-1}$
- Produces a new hidden state $h_t$
The update equation is:
where:
- $W_{xh}$: Input-to-hidden weights
- $W_{hh}$: Hidden-to-hidden (recurrent) weights
- $\tanh$: Non-linearity that squashes values to $[-1, 1]$
Parameter Sharing Across Time
The same weights $(W_{xh}, W_{hh}, b_h)$ are reused at every time step. This parameter sharing has two practical consequences:
- The model can generalize to sequences of different lengths
- The number of parameters is independent of sequence length
This is a fundamentally different strategy from, say, a 1D convolution with a fixed receptive field, or a fully connected layer that has separate parameters for each position. Our implementation in PyTorch uses just 13,314 trainable parameters to process sequences of length 20 -- and the same weights would work on sequences of length 100 or 1,000 without any architectural change.
Backpropagation Through Time (BPTT)
Training RNNs requires computing gradients through the entire sequence. The chain rule applied across time steps gives:
Because $h_t$ depends on $h_{t-1}$, which depends on $h_{t-2}$, and so on, gradients must flow backward through the entire chain of dependencies. This is what makes RNN training fundamentally different from training feedforward networks: the computational graph is unrolled across time, and every time step introduces another node that gradients must pass through.
The Vanishing Gradient Problem
When we unroll the recurrence, we see the issue:
When the eigenvalues of $W_{hh}$ sit below 1, this product shrinks exponentially with sequence length. The result: gradients from early time steps effectively disappear, and the network cannot learn long-range dependencies.
The flip side exists too. If the eigenvalues exceed 1, gradients explode -- they grow exponentially and destabilize training. Gradient clipping can mitigate exploding gradients, but there is no equally clean fix for the vanishing case. This asymmetry is what makes vanilla RNNs unreliable on long sequences and is the core motivation behind gated architectures like LSTMs and GRUs.
Why RNNs Still Matter
Despite being overshadowed by Transformers, RNNs remain relevant:
- Efficiency: Process sequences incrementally (online inference)
- Simplicity: Fewer parameters than attention-based models
- Foundation: Understanding RNNs is prerequisite to LSTMs and GRUs
What Comes Next
The recurrence relation gives RNNs a straightforward way to model sequences, but vanishing gradients cap how far back they can actually learn from. In Part 2, we translate these equations into PyTorch code and build out several architecture variants.