RNNs from Scratch: Part 2 - Building the Architecture

Introduction

Part 1 covered the math. Here we turn those equations into working PyTorch code, starting from a single RNN cell and building up to multi-layer sequence models with several architecture variants.

The RNN Cell

Our RNNCell class implements the core recurrence from Part 1 as a proper PyTorch module. Two linear layers handle the input-to-hidden and hidden-to-hidden projections, and a tanh squashes the result:

rnn.py

class RNNCell(nn.Module):
    def __init__(self, input_size: int, hidden_size: int, bias: bool = True):
        super().__init__()
        self.Wxh = nn.Linear(input_size, hidden_size, bias=bias)
        self.Whh = nn.Linear(hidden_size, hidden_size, bias=bias)
        self._init_weights()

    def _init_weights(self):
        nn.init.xavier_uniform_(self.Wxh.weight)
        nn.init.xavier_uniform_(self.Whh.weight)
        if self.Wxh.bias is not None:
            nn.init.zeros_(self.Wxh.bias)
            nn.init.zeros_(self.Whh.bias)

    def forward(self, x, h_prev):
        return torch.tanh(self.Wxh(x) + self.Whh(h_prev))

Key design choices:

Separate weight matrices for input and hidden projections -- this keeps the two signal sources decoupled and easier to reason about
Xavier initialization to stabilize training by keeping variance roughly constant across layers
Biases initialized to zero so the cell starts with no directional preference

Multi-Layer RNNs

Stacking cells vertically gives the network more capacity to learn hierarchical features:

Layer 1 processes the raw input sequence
Layer 2 processes the hidden states from Layer 1
Each additional layer learns higher-level temporal patterns

The multi-layer RNN class stores cells in a ModuleList and loops through both time steps and layers. At each time step, the input passes through every layer, with the output of each layer becoming the input to the next:

rnn.py -- forward loop

for t in range(seq_len):
    x_t = x[:, t, :]  # Input at time t
    for layer in range(self.num_layers):
        hidden_states[layer] = self.cells[layer](x_t, hidden_states[layer])
        x_t = hidden_states[layer]
        if self.dropout is not None and layer < self.num_layers - 1:
            x_t = self.dropout(x_t)
    outputs.append(hidden_states[-1].unsqueeze(1))

Notice that dropout is applied between layers, never across time steps within a layer. Dropping activations across time would break the recurrence and corrupt the hidden state trajectory.

Architecture Variants

Sequence Classifier

Many-to-one architecture. The RNN reads the entire sequence and the final hidden state -- which is a compressed summary of the full input -- gets projected through a linear layer to produce class logits:

rnn_architectures.py

class SequenceClassifier(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes,
                 num_layers=2, dropout=0.2):
        super().__init__()
        self.rnn = RNN(input_size, hidden_size, num_layers, dropout=dropout)
        self.fc = nn.Linear(hidden_size, num_classes)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        outputs, h_final = self.rnn(x)
        h_final = self.dropout(h_final[-1])  # last layer
        return self.fc(h_final)

This is the architecture we train in Part 3 -- a 2-layer RNN with hidden size 64, totaling 13,314 parameters.

Sequence Tagger

Many-to-many architecture. Instead of collapsing to a single output, the tagger keeps every hidden state and projects each one independently. The input shape is (batch, seq_len, input_size) and the output is (batch, seq_len, num_tags) -- one prediction per time step. This is the setup you would use for tasks like part-of-speech tagging or named entity recognition.

CharRNN

Character-level language model. An embedding layer maps character indices to dense vectors, a 3-layer RNN processes the embedded sequence, and a final linear layer projects back to vocabulary logits. Generation uses temperature-scaled sampling in an autoregressive loop:

rnn_architectures.py -- generate()

next_logits = logits[:, -1, :] / temperature
probs = F.softmax(next_logits, dim=-1)
next_token = torch.multinomial(probs, 1)

Higher temperature flattens the probability distribution and produces more varied (but less coherent) output. Lower temperature sharpens it toward greedy decoding.

Bidirectional RNN

Two separate RNN instances process the sequence in opposite directions. The backward RNN receives the input flipped along the time axis. After both passes complete, the forward and backward hidden states at each position are concatenated, doubling the representation size:

rnn_architectures.py

fwd_outputs, _ = self.forward_rnn(x)
bwd_outputs, _ = self.backward_rnn(x.flip(dims=[1]))
bwd_outputs = bwd_outputs.flip(dims=[1])
outputs = torch.cat([fwd_outputs, bwd_outputs], dim=-1)

This gives each position access to context from both the past and the future, which is essential for tasks like NER where the label for a word depends on the words around it in both directions.

Implementation Details

Batch-First vs Time-First

We use batch-first tensors (B, T, D) for compatibility with PyTorch conventions, but internally process time step by time step. If the caller passes time-first input, the forward method transposes it before the main loop and transposes back before returning.

Hidden State Initialization

Hidden states are initialized to zero at the start of each sequence: $h_0 = \mathbf{0}$. This is the standard convention. The caller can also pass in a custom initial state, which is useful for continuing generation from a previous context or for warm-starting the hidden state in streaming applications.

Training Strategy

We train the SequenceClassifier on a synthetic binary classification task. The data generation is simple: random Gaussian sequences of length 20 with input dimension 10, labeled by whether the mean value across the sequence is positive or negative. The training setup uses:

5,000 training samples and 1,000 test samples
Cross-entropy loss with Adam optimizer (lr = 0.001)
Step learning rate scheduler: halves the rate every 20 epochs
Mini-batches of size 32

What Comes Next

The implementation is done. In Part 3, we train this model on a synthetic classification task and look at what the hidden states actually do during inference.