Part 1 covered the math. Here we turn those equations into working PyTorch code, starting from a single RNN cell and building up to multi-layer sequence models with several architecture variants.
The RNN Cell
Our RNNCell class implements the core recurrence from Part 1 as a proper PyTorch module. Two linear layers handle the input-to-hidden and hidden-to-hidden projections, and a tanh squashes the result:
class RNNCell(nn.Module):
def __init__(self, input_size: int, hidden_size: int, bias: bool = True):
super().__init__()
self.Wxh = nn.Linear(input_size, hidden_size, bias=bias)
self.Whh = nn.Linear(hidden_size, hidden_size, bias=bias)
self._init_weights()
def _init_weights(self):
nn.init.xavier_uniform_(self.Wxh.weight)
nn.init.xavier_uniform_(self.Whh.weight)
if self.Wxh.bias is not None:
nn.init.zeros_(self.Wxh.bias)
nn.init.zeros_(self.Whh.bias)
def forward(self, x, h_prev):
return torch.tanh(self.Wxh(x) + self.Whh(h_prev))
Key design choices:
- Separate weight matrices for input and hidden projections -- this keeps the two signal sources decoupled and easier to reason about
- Xavier initialization to stabilize training by keeping variance roughly constant across layers
- Biases initialized to zero so the cell starts with no directional preference
Multi-Layer RNNs
Stacking cells vertically gives the network more capacity to learn hierarchical features:
- Layer 1 processes the raw input sequence
- Layer 2 processes the hidden states from Layer 1
- Each additional layer learns higher-level temporal patterns
The multi-layer RNN class stores cells in a ModuleList and loops through both time steps and layers. At each time step, the input passes through every layer, with the output of each layer becoming the input to the next:
for t in range(seq_len):
x_t = x[:, t, :] # Input at time t
for layer in range(self.num_layers):
hidden_states[layer] = self.cells[layer](x_t, hidden_states[layer])
x_t = hidden_states[layer]
if self.dropout is not None and layer < self.num_layers - 1:
x_t = self.dropout(x_t)
outputs.append(hidden_states[-1].unsqueeze(1))
Notice that dropout is applied between layers, never across time steps within a layer. Dropping activations across time would break the recurrence and corrupt the hidden state trajectory.
Architecture Variants
Sequence Classifier
Many-to-one architecture. The RNN reads the entire sequence and the final hidden state -- which is a compressed summary of the full input -- gets projected through a linear layer to produce class logits:
class SequenceClassifier(nn.Module):
def __init__(self, input_size, hidden_size, num_classes,
num_layers=2, dropout=0.2):
super().__init__()
self.rnn = RNN(input_size, hidden_size, num_layers, dropout=dropout)
self.fc = nn.Linear(hidden_size, num_classes)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
outputs, h_final = self.rnn(x)
h_final = self.dropout(h_final[-1]) # last layer
return self.fc(h_final)
This is the architecture we train in Part 3 -- a 2-layer RNN with hidden size 64, totaling 13,314 parameters.
Sequence Tagger
Many-to-many architecture. Instead of collapsing to a single output, the tagger keeps every hidden state and projects each one independently. The input shape is (batch, seq_len, input_size) and the output is (batch, seq_len, num_tags) -- one prediction per time step. This is the setup you would use for tasks like part-of-speech tagging or named entity recognition.
CharRNN
Character-level language model. An embedding layer maps character indices to dense vectors, a 3-layer RNN processes the embedded sequence, and a final linear layer projects back to vocabulary logits. Generation uses temperature-scaled sampling in an autoregressive loop:
next_logits = logits[:, -1, :] / temperature
probs = F.softmax(next_logits, dim=-1)
next_token = torch.multinomial(probs, 1)
Higher temperature flattens the probability distribution and produces more varied (but less coherent) output. Lower temperature sharpens it toward greedy decoding.
Bidirectional RNN
Two separate RNN instances process the sequence in opposite directions. The backward RNN receives the input flipped along the time axis. After both passes complete, the forward and backward hidden states at each position are concatenated, doubling the representation size:
fwd_outputs, _ = self.forward_rnn(x)
bwd_outputs, _ = self.backward_rnn(x.flip(dims=[1]))
bwd_outputs = bwd_outputs.flip(dims=[1])
outputs = torch.cat([fwd_outputs, bwd_outputs], dim=-1)
This gives each position access to context from both the past and the future, which is essential for tasks like NER where the label for a word depends on the words around it in both directions.
Implementation Details
Batch-First vs Time-First
We use batch-first tensors (B, T, D) for compatibility with PyTorch conventions, but internally process time step by time step. If the caller passes time-first input, the forward method transposes it before the main loop and transposes back before returning.
Hidden State Initialization
Hidden states are initialized to zero at the start of each sequence: $h_0 = \mathbf{0}$. This is the standard convention. The caller can also pass in a custom initial state, which is useful for continuing generation from a previous context or for warm-starting the hidden state in streaming applications.
Training Strategy
We train the SequenceClassifier on a synthetic binary classification task. The data generation is simple: random Gaussian sequences of length 20 with input dimension 10, labeled by whether the mean value across the sequence is positive or negative. The training setup uses:
- 5,000 training samples and 1,000 test samples
- Cross-entropy loss with Adam optimizer (lr = 0.001)
- Step learning rate scheduler: halves the rate every 20 epochs
- Mini-batches of size 32
What Comes Next
The implementation is done. In Part 3, we train this model on a synthetic classification task and look at what the hidden states actually do during inference.