RNNs from Scratch: Part 1 - The Math of Recurrence

Introduction

Before Transformers dominated sequence modeling, Recurrent Neural Networks were the architecture of choice for temporal data. Unlike feedforward networks that process inputs independently, RNNs maintain an internal state that evolves over time, enabling them to capture dependencies across time steps.

In Part 1 of this mini-series, we deconstruct the mathematical foundation of RNNs: the recurrence relation, backpropagation through time, and the infamous vanishing gradient problem.

The Intuition Behind Recurrence

Consider reading a sentence word by word. As you read, you build up context—each word modifies your understanding based on what came before. RNNs formalize this intuition mathematically through a hidden state that accumulates information across time steps.

The Recurrence Relation

At each time step $t$, an RNN:

Receives input $x_t$
Combines it with the previous hidden state $h_{t-1}$
Produces a new hidden state $h_t$

The update equation is:

h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h)

where:

$W_{xh}$: Input-to-hidden weights
$W_{hh}$: Hidden-to-hidden (recurrent) weights
$\tanh$: Non-linearity that squashes values to $[-1, 1]$

Parameter Sharing Across Time

Critically, the same weights $(W_{xh}, W_{hh}, b_h)$ are used at every time step. This is parameter sharing, and it provides two key benefits:

The model can generalize to sequences of different lengths
The number of parameters is independent of sequence length

Backpropagation Through Time (BPTT)

Training RNNs requires computing gradients through the entire sequence. The chain rule applied across time steps gives:

\frac{\partial L}{\partial W_{hh}} = \sum_t \frac{\partial L}{\partial h_t} \cdot \frac{\partial h_t}{\partial W_{hh}}

Because $h_t$ depends on $h_{t-1}$, which depends on $h_{t-2}$, and so on, gradients must flow backward through the entire chain of dependencies.

The Vanishing Gradient Problem

When we unroll the recurrence, we see the issue:

\frac{\partial h_t}{\partial h_0} = \prod_{k=1}^{t} \frac{\partial h_k}{\partial h_{k-1}} = \prod_{k=1}^{t} W_{hh}^T \cdot \text{diag}(1 - \tanh^2(\cdot))

If the eigenvalues of $W_{hh}$ are less than 1, repeated multiplication causes gradients to vanish exponentially. This makes it impossible for standard RNNs to learn long-range dependencies.

Why RNNs Still Matter

Despite being overshadowed by Transformers, RNNs remain relevant:

Efficiency: Process sequences incrementally (online inference)
Simplicity: Fewer parameters than attention-based models
Foundation: Understanding RNNs is prerequisite to LSTMs and GRUs

Conclusion

The RNN's elegant recurrence relation provides a natural framework for sequence modeling. However, the vanishing gradient problem limits its practical utility. In Part 2, we implement RNNs in PyTorch and explore architectural improvements.