Before Transformers dominated sequence modeling, Recurrent Neural Networks were the architecture of choice for temporal data. Unlike feedforward networks that process inputs independently, RNNs maintain an internal state that evolves over time, enabling them to capture dependencies across time steps.
In Part 1 of this mini-series, we deconstruct the mathematical foundation of RNNs: the recurrence relation, backpropagation through time, and the infamous vanishing gradient problem.
The Intuition Behind Recurrence
Consider reading a sentence word by word. As you read, you build up context—each word modifies your understanding based on what came before. RNNs formalize this intuition mathematically through a hidden state that accumulates information across time steps.
The Recurrence Relation
At each time step $t$, an RNN:
- Receives input $x_t$
- Combines it with the previous hidden state $h_{t-1}$
- Produces a new hidden state $h_t$
The update equation is:
where:
- $W_{xh}$: Input-to-hidden weights
- $W_{hh}$: Hidden-to-hidden (recurrent) weights
- $\tanh$: Non-linearity that squashes values to $[-1, 1]$
Parameter Sharing Across Time
Critically, the same weights $(W_{xh}, W_{hh}, b_h)$ are used at every time step. This is parameter sharing, and it provides two key benefits:
- The model can generalize to sequences of different lengths
- The number of parameters is independent of sequence length
Backpropagation Through Time (BPTT)
Training RNNs requires computing gradients through the entire sequence. The chain rule applied across time steps gives:
Because $h_t$ depends on $h_{t-1}$, which depends on $h_{t-2}$, and so on, gradients must flow backward through the entire chain of dependencies.
The Vanishing Gradient Problem
When we unroll the recurrence, we see the issue:
If the eigenvalues of $W_{hh}$ are less than 1, repeated multiplication causes gradients to vanish exponentially. This makes it impossible for standard RNNs to learn long-range dependencies.
Why RNNs Still Matter
Despite being overshadowed by Transformers, RNNs remain relevant:
- Efficiency: Process sequences incrementally (online inference)
- Simplicity: Fewer parameters than attention-based models
- Foundation: Understanding RNNs is prerequisite to LSTMs and GRUs
Conclusion
The RNN's elegant recurrence relation provides a natural framework for sequence modeling. However, the vanishing gradient problem limits its practical utility. In Part 2, we implement RNNs in PyTorch and explore architectural improvements.