Part 1 covered the math: forget gate, input gate, output gate, and the additive cell state update. Now we translate those equations into working PyTorch code, building from a single cell up to full sequence models.
The LSTM Cell
Each gate is a linear projection of the concatenated input $x_t$ and previous hidden state $h_{t-1}$, followed by a nonlinearity:
# Compute gates
i = torch.sigmoid(W_xi @ x + W_hi @ h_prev) # Input gate
f = torch.sigmoid(W_xf @ x + W_hf @ h_prev) # Forget gate
g = torch.tanh(W_xc @ x + W_hc @ h_prev) # Cell candidate
o = torch.sigmoid(W_xo @ x + W_ho @ h_prev) # Output gate
# Cell state update (the gradient highway)
c_new = f * c_prev + i * g
# Hidden state update
h_new = o * torch.tanh(c_new)
Weight Initialization
All weights use Xavier initialization, with one important detail: forget gate biases start at 1.0. Without this, the forget gate sigmoid outputs ~0.5 at initialization, immediately discarding half the cell state. Setting the bias to 1 pushes the initial output toward 1 (preserve everything), which stabilizes early training. Jozefowicz et al. (2015) showed this is critical for consistent convergence.
# Critical: Initialize forget gate bias to 1.0
nn.init.ones_(self.Wxf.bias)
nn.init.ones_(self.Whf.bias)
Multi-Layer LSTMs
Stacking cells gives you hierarchical temporal processing:
- Lower layers pick up short-range patterns (local token interactions)
- Higher layers capture longer-range structure (sentence-level meaning)
- Dropout between layers regularizes the stack
Each layer's hidden state $h_t^{(l)}$ feeds into the layer above it. Dropout is applied between layers only, not after the final output -- matching PyTorch's built-in LSTM convention.
Architecture Variants
LSTM Classifier (Many-to-One)
For sequence classification:
- Run the full sequence through the multi-layer LSTM
- Concatenate final hidden states from all layers
- Project to class logits through a linear layer
- Bidirectional mode doubles the representation
LSTM Tagger (Many-to-Many)
For sequence labeling (POS tagging, NER):
- Emit a prediction at every time step
- Each $h_t$ is projected to the label vocabulary
CharLSTM
Character-level language model:
- Embedding layer maps character indices to dense vectors
- Multi-layer LSTM processes the sequence
- Generation uses temperature-controlled sampling
def generate(self, start_token, seq_len, temperature=1.0):
# Sample next character from the distribution
next_logits = logits[:, -1, :] / temperature
probs = torch.softmax(next_logits, dim=-1)
next_token = torch.multinomial(probs, 1)
Seq2SeqLSTM (Encoder-Decoder)
The encoder-decoder architecture:
- Encoder LSTM compresses the source sequence into a fixed-size context vector (final hidden and cell states)
- Decoder LSTM initializes from the encoder's final states and generates the target autoregressively
Bidirectional Processing
When the model needs full left-right context (sentiment analysis, NER), we run two LSTMs in opposite directions:
- Forward pass: $h_t^{\rightarrow}$
- Backward pass: $h_t^{\leftarrow}$
- Concatenation: $h_t = [h_t^{\rightarrow}; h_t^{\leftarrow}]$
Training Strategy
We test on a synthetic long-range dependency task designed to expose vanishing gradients:
- Input: random sequence of length 30
- Task: classify based on the first + last element only
- The 28 intermediate values are noise -- the model must learn to ignore them while retaining the endpoints
Gradient clipping handles the complementary exploding gradient problem:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Next: Visualizing Gate Dynamics
The implementation is done. In Part 3, we train this model and look at what the gates actually learn -- extracting forget, input, and output gate activations across time to see how the network decides what to remember and what to discard.