Back to LSTMs Hub

LSTMs from Scratch

Part 3: Training and Analyzing Gates

Introduction

In Part 1, we established the mathematical foundation of gated recurrence. In Part 2, we built the complete LSTM architecture in pure PyTorch—from individual cells to multi-layer stacks, classifiers, taggers, and encoder-decoder models.

Now, we reap the rewards: training our LSTM on a long-range dependency task and dissecting what the gates actually learn. The results reveal the beautiful internal strategies that emerge when a neural network must learn to remember.

The Training Process

We trained a 2-layer LSTM (128 hidden units, 197,378 parameters) on a synthetic long-range dependency task: classify a sequence of length 30 based solely on the first and last elements. This task is deliberately designed to be impossible for standard RNNs—the model must preserve information across 30 time steps of irrelevant noise.

We used Adam optimizer with learning rate $10^{-3}$, a step learning rate scheduler (halving every 20 epochs), gradient clipping at $\|\nabla\| = 1.0$, and trained for 100 epochs on 5,000 training sequences.

Training Loss and Accuracy Curves
Figure 1: Training and test loss (left) and accuracy (right) over 100 epochs. The LSTM converges rapidly, reaching ~100% training accuracy by epoch 30 and peaking at ~95% test accuracy. The growing gap between train and test loss after epoch 30 shows classic overfitting—the model memorizes the training set while generalizing well to unseen sequences. Compare this to a standard RNN, which would plateau around 60-70% on the same task.

Training Results

Our 2-layer LSTM achieves:

The model learns the long-range dependency task rapidly—reaching 94% test accuracy within just 10 epochs. This is remarkable given that a standard RNN would struggle to exceed 60-70% on the same task, even with extensive training, due to vanishing gradients.

Visualizing Gate Dynamics

To understand how the LSTM solves this task, we extract and visualize gate activations across all 30 time steps for 8 individual hidden units in the first layer. The patterns that emerge are remarkably interpretable.

Gate Activations Dynamics
Figure 2: Gate activations across 30 time steps for 8 hidden units in the first LSTM layer. Top-left: Input gate—several units (e.g., Unit 1) learn to saturate near 1.0 over time, indicating persistent information storage. Top-right: Forget gate—most units maintain activations between 0.8-1.0, confirming the gradient highway stays open. Bottom-left: Output gate—diverse patterns with some units learning to selectively reveal information. Bottom-right: Cell candidate (tanh) values show the rich diversity of candidate information generated at each step.

Forget Gate Patterns

The forget gate plot reveals the most critical insight: nearly all 8 visualized units maintain activations between 0.8 and 1.0 throughout the 30-step sequence. This is exactly the "Constant Error Carousel" that Hochreiter and Schmidhuber described—the gradient highway stays open, allowing error signals to flow backward unattenuated.

Input Gate Behavior

The input gate shows more varied patterns across units:

Output Gate Modulation

The output gate displays the richest diversity:

The Cell State as Memory

Constant Error Carousel

The forget gate visualizations directly confirm Hochreiter and Schmidhuber's "Constant Error Carousel" hypothesis. With forget gates at 0.8-1.0, the cell state equation

$$ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t $$

preserves the previous cell state almost entirely while selectively adding new information. Over 30 time steps with $f_t \approx 0.9$, the retained signal is $0.9^{30} \approx 0.04$—still enough to maintain a gradient pathway, unlike an RNN where multiplicative decay drives this value to near zero.

Overfitting Analysis

The training curves reveal a classic deep learning phenomenon: the model reaches 100% training accuracy by epoch 30 but test accuracy plateaus at ~95%. The widening loss gap confirms overfitting. In a production setting, we would apply:

Gate Coordination Patterns

Copy Mechanism

For the long-range dependency task, the gates learn a clean copy strategy:

This is exactly the strategy a human designer would implement—but the network discovers it entirely through gradient descent.

LSTMs vs GRUs vs Transformers

GRU Simplification

Gated Recurrent Units (Cho et al., 2014) simplify LSTMs:

Transformer Comparison

LSTMs remain the architecture of choice for streaming and real-time applications where constant inference cost is critical.

Conclusion

We built LSTMs entirely from scratch in PyTorch—no external libraries, no pre-trained weights. Our 2-layer LSTM achieved ~100% training accuracy and ~95% test accuracy on a 30-step long-range dependency task, and the gate visualizations confirm the theoretical predictions: forget gates stay open to create gradient highways, input gates selectively store relevant information, and output gates modulate what gets revealed for classification.

Thank you for following this 3-part "Build in Public" series on LSTMs. The full training logs, gate visualizations, and code are live on the GitHub repo. Stay connected on LinkedIn for future architectural tear-downs!