Back to LSTMs Hub

LSTMs from Scratch

Part 3: Training and Analyzing Gates

Introduction

Part 1 covered the math. Part 2 built the architecture in PyTorch. Now we train the model on a long-range dependency task and look at what the gates actually learn.

Training Setup

We trained a 2-layer LSTM (128 hidden units, 204,290 parameters) on a synthetic task: classify a length-30 sequence based solely on its first and last elements. The 28 intermediate values are random noise. A standard RNN cannot solve this -- by the time it reaches the end of the sequence, the first element has been washed out by vanishing gradients.

Training details: Adam with lr $10^{-3}$, step scheduler halving every 20 epochs, gradient clipping at $\|\nabla\| = 1.0$, 100 epochs, 5,000 training sequences.

Training Loss and Accuracy Curves
Figure 1: Loss and accuracy over 100 epochs. The model hits 94.4% test accuracy by epoch 10 and peaks at 95.7% (epoch 17). After epoch 30, the train-test loss gap widens as the model overfits the 5,000 training examples.

Training Results

From the training log:

The model cracks the task fast -- 90.5% test accuracy by epoch 5. The gap between the peak (95.7% at epoch 17) and the final test accuracy (94.1% at epoch 100) reflects overfitting on a small dataset, not a failure of the architecture.

Visualizing Gate Dynamics

To see how the LSTM solves this task, we extracted gate activations across all 30 time steps for 8 hidden units in the first layer.

Gate Activations Dynamics
Figure 2: Per-unit gate activations over the 30-step sequence. Forget gates cluster near 0.8-1.0, keeping the gradient highway open. Input and output gates show unit-level specialization.

Forget Gate Patterns

Nearly all 8 units hold forget gate activations between 0.8 and 1.0 across the full sequence. This is the Constant Error Carousel in action -- the network learns to keep the gradient highway open so that error signals from the classification head can reach the first time step.

Input Gate Behavior

More varied than the forget gate:

Output Gate Modulation

The most diverse gate:

The Cell State as Memory

Constant Error Carousel

The forget gate data confirms Hochreiter and Schmidhuber's hypothesis directly. With $f_t$ in the 0.8-1.0 range, the update

$$ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t $$

preserves most of the previous cell state while selectively adding new information. Even with $f_t \approx 0.9$, after 30 steps the retained fraction is $0.9^{30} \approx 0.04$ -- small, but enough to maintain a gradient pathway. In a vanilla RNN, the equivalent multiplicative decay drives this to effectively zero.

Overfitting Analysis

100% training accuracy by epoch 50, but test accuracy plateaus around 95% and slowly degrades. The widening loss gap (train: 0.0003, test: 0.3073 at epoch 100) is textbook overfitting on a small dataset. Fixes for production:

Gate Coordination: The Copy Strategy

The gates learn a clean three-phase pattern for this task:

This is the strategy you would design by hand -- but the network finds it through gradient descent alone.

LSTMs vs GRUs vs Transformers

GRU Simplification

GRUs (Cho et al., 2014) collapse the LSTM's two states into one and merge the forget/input gates into a single update gate. Fewer parameters, often similar performance on shorter sequences.

Transformer Comparison

LSTMs still make sense for streaming and real-time applications where constant per-step cost matters.

Conclusion

A from-scratch 2-layer LSTM with 204,290 parameters reaches 95.7% peak test accuracy on a 30-step dependency task, hitting 90%+ within 5 epochs. The gate visualizations line up with theory: forget gates hold open to create gradient highways, input gates selectively write relevant information, and output gates control what gets read out for classification. Full code, training logs, and visualizations are on the GitHub repo.