Back to RNNs Hub

RNNs from Scratch

Part 3: Training and Analyzing Dynamics

Introduction

With the RNN implemented, we train it on a synthetic sequence classification task and dig into the results. This post covers the training numbers, hidden state visualizations, and a practical look at where vanilla RNNs break down.

Training Results

The model has 13,314 parameters and was trained for 50 epochs on CPU. Final numbers from the run:

Most of the learning happens in the first 10 epochs, with train accuracy jumping from 50.31% to 87.81% and test loss dropping from 0.6670 to 0.2036. After that, gains are incremental -- the model plateaus around 93% test accuracy by epoch 20 and fluctuates there for the remaining 30 epochs. The learning rate scheduler halves the rate at epoch 20, which briefly tightens the loss curves, but the network is already near its capacity on this task.

Training Loss and Accuracy Curves
Loss and accuracy over 50 epochs. The sharp drop in loss during epochs 1-10 gives way to noisy plateaus.

Visualizing Hidden State Dynamics

Hidden State Trajectories

Plotting individual hidden units across time steps reveals different roles. Some units spike in response to specific input patterns and reset quickly. Others ramp up gradually, integrating information over multiple steps. The tanh squashing keeps all activations in $[-1, 1]$, which you can see clearly in the plots below.

Hidden States Visualizations
Hidden unit activations over time steps for a sample input sequence.

Temporal Integration

Three broad behaviors show up across the 64 hidden units:

The classification task requires computing a global mean, so the accumulator units are doing most of the heavy lifting. They approximate a running average, and the final-step readout extracts the sign of that average for the binary label. The transient and oscillatory units likely encode finer-grained patterns that help with boundary cases.

The Vanishing Gradient Problem in Practice

Gradient Flow Analysis

The total gradient with respect to the weights sums contributions from every time step:

$$ \frac{\partial L}{\partial W} = \sum_t \frac{\partial L}{\partial h_t} \cdot \frac{\partial h_t}{\partial W} $$

In practice, the contributions from early time steps are orders of magnitude smaller than those from later steps. Each backward step multiplies by $W_{hh}^T \cdot \text{diag}(1 - \tanh^2(\cdot))$, and with tanh derivatives bounded between 0 and 1, the product shrinks rapidly.

What This Looks Like in Practice

The consequence is straightforward:

For our 20-step sequences this is manageable -- the network still reaches 93.7% test accuracy. But scale up to sequences of length 100 or 500, and the model would struggle to learn dependencies that span the full input. The training loss would stall at a higher value, and early-sequence information would be effectively invisible to the optimizer.

Why LSTMs and GRUs Were Invented

The vanishing gradient problem motivated gated architectures:

RNNs vs Transformers

Computational Complexity

Memory Characteristics

Use Cases

Wrapping Up

Our 13K-parameter RNN hits 93.7% test accuracy on this synthetic task, which is decent but clearly limited. The vanishing gradient problem is not just theoretical -- it shows up directly in the gradient magnitudes and in the network's inability to use early sequence information. LSTMs and GRUs were designed specifically to fix this, and Transformers sidestep it entirely with attention. Still, RNNs are worth understanding: they are simple, memory-efficient for streaming inference, and the foundation for everything that came after.