Standard RNNs have a fundamental flaw: they cannot learn long-range dependencies. The vanishing gradient problem causes gradients to decay exponentially as they backpropagate through time, making it impossible to connect events separated by many time steps.
Long Short-Term Memory (LSTM) networks, introduced by Hochreiter and Schmidhuber in 1997, solved this problem through an elegant architectural innovation: gated memory cells.
In Part 1 of this "Build in Public" mini-series, we deconstruct the mathematical foundation of LSTMs: the cell state, the three gates, and how they enable gradient flow across hundreds of time steps. In Part 2, we will build the complete architecture in PyTorch. In Part 3, we will train on a long-range dependency task and visualize gate dynamics.
The Core Insight: Additive Memory
The key innovation of LSTMs is the cell state $c_t$, which runs through the entire sequence like a conveyor belt. Information is added to or removed from this state through gates—learned, differentiable switches that control information flow.
Critically, the cell state update is additive:
This additive structure creates a gradient highway—gradients can flow backward through time without vanishing because addition doesn't multiply gradients like matrix multiplication does. This is the core reason LSTMs work where vanilla RNNs fail.
The Three Gates
Forget Gate
The forget gate decides what information to discard from the cell state:
Output: values between 0 (completely forget) and 1 (completely keep).
Input Gate
The input gate controls what new information to store:
where $i_t$ decides which values to update, and $\tilde{c}_t$ creates candidate values.
Output Gate
The output gate determines what to output based on the cell state:
The Cell State Update
Combining all gates:
Why LSTMs Don't Vanish
During backpropagation, the gradient of the cell state is:
When the forget gate is close to 1, gradients flow through unchanged. The LSTM can learn to keep $f_t \approx 1$ when long-range information is needed, creating an unconstrained gradient pathway. Hochreiter and Schmidhuber called this the "Constant Error Carousel" — a mechanism that allows error signals to flow backward for hundreds or even thousands of time steps.
Comparison to RNNs
- RNN: Single hidden state, multiplicative updates → vanishing gradients
- LSTM: Separate cell state, additive updates → stable gradients
Next Steps: From Math to Code
We have established the mathematical foundation of gated recurrence: three gates controlling a dedicated additive memory cell.
In Part 2, we implement this exact formulation in pure PyTorch. We will build from a single LSTMCell up through multi-layer LSTMs, bidirectional processing, sequence classification, and the encoder-decoder Seq2Seq architecture.
Stay tuned for the code drop as we build LSTMs from scratch!