LSTMs from Scratch: Part 1 - The Math of Gated Recurrence

Introduction

Standard RNNs have a fundamental flaw: they cannot learn long-range dependencies. The vanishing gradient problem causes gradients to decay exponentially as they backpropagate through time, making it impossible to connect events separated by many time steps.

Long Short-Term Memory (LSTM) networks, introduced by Hochreiter and Schmidhuber in 1997, solved this problem through an elegant architectural innovation: gated memory cells.

In Part 1 of this "Build in Public" mini-series, we deconstruct the mathematical foundation of LSTMs: the cell state, the three gates, and how they enable gradient flow across hundreds of time steps. In Part 2, we will build the complete architecture in PyTorch. In Part 3, we will train on a long-range dependency task and visualize gate dynamics.

The Core Insight: Additive Memory

The key innovation of LSTMs is the cell state $c_t$, which runs through the entire sequence like a conveyor belt. Information is added to or removed from this state through gates—learned, differentiable switches that control information flow.

Critically, the cell state update is additive:

c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t

This additive structure creates a gradient highway—gradients can flow backward through time without vanishing because addition doesn't multiply gradients like matrix multiplication does. This is the core reason LSTMs work where vanilla RNNs fail.

The Three Gates

Forget Gate

The forget gate decides what information to discard from the cell state:

f_t = \sigma(W_{xf} x_t + W_{hf} h_{t-1} + b_f)

Output: values between 0 (completely forget) and 1 (completely keep).

Input Gate

The input gate controls what new information to store:

\begin{align*} i_t &= \sigma(W_{xi} x_t + W_{hi} h_{t-1} + b_i) \\ \tilde{c}_t &= \tanh(W_{xc} x_t + W_{hc} h_{t-1} + b_c) \end{align*}

where $i_t$ decides which values to update, and $\tilde{c}_t$ creates candidate values.

Output Gate

The output gate determines what to output based on the cell state:

\begin{align*} o_t &= \sigma(W_{xo} x_t + W_{ho} h_{t-1} + b_o) \\ h_t &= o_t \odot \tanh(c_t) \end{align*}

The Cell State Update

Combining all gates:

c_t = \underbrace{f_t \odot c_{t-1}}_{\text{keep old}} + \underbrace{i_t \odot \tilde{c}_t}_{\text{add new}}

h_t = o_t \odot \tanh(c_t)

Why LSTMs Don't Vanish

During backpropagation, the gradient of the cell state is:

\frac{\partial c_t}{\partial c_{t-1}} = f_t + \text{(other terms)}

When the forget gate is close to 1, gradients flow through unchanged. The LSTM can learn to keep $f_t \approx 1$ when long-range information is needed, creating an unconstrained gradient pathway. Hochreiter and Schmidhuber called this the "Constant Error Carousel" — a mechanism that allows error signals to flow backward for hundreds or even thousands of time steps.

Comparison to RNNs

RNN: Single hidden state, multiplicative updates → vanishing gradients
LSTM: Separate cell state, additive updates → stable gradients

Next Steps: From Math to Code

We have established the mathematical foundation of gated recurrence: three gates controlling a dedicated additive memory cell.

In Part 2, we implement this exact formulation in pure PyTorch. We will build from a single LSTMCell up through multi-layer LSTMs, bidirectional processing, sequence classification, and the encoder-decoder Seq2Seq architecture.

Stay tuned for the code drop as we build LSTMs from scratch!