Part 1 laid out the math of Self-Attention and Positional Encodings. Now we turn those equations into running code.
We skip nn.Transformer and build the full sequence-to-sequence model from scratch in PyTorch -- managing tensor dimensions, multi-head reshapes, and causality masks by hand.
Implementing Scaled Dot-Product Attention
The attention mechanism requires projecting our inputs, performing batched matrix multiplications, and applying masking. Here is the core module:
class ScaledDotProductAttention(nn.Module):
def forward(self, q, k, v, mask=None):
d_k = q.size(-1)
scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attention_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, v)
return output, attention_weights
Notice the use of -1e9 when applying the mask. When we pass this tensor through the softmax activation, those values accurately converge to zero, ensuring the model cannot attend to padded or future tokens. We return both the output and the raw attention weights -- the weights become critical in Part 3 when we extract and visualize the cross-attention maps.
Multi-Head Attention: The Reshape Trick
The multi-head mechanism does not literally instantiate separate attention modules. Instead, we project once into the full $d_{model}$ space, then reshape the tensor to split it across heads. This is where the dimension management gets precise:
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
self.attention = ScaledDotProductAttention()
def forward(self, q, k, v, mask=None):
batch_size = q.size(0)
# Project and reshape: (batch, seq, d_model) -> (batch, heads, seq, d_k)
q = self.W_q(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
k = self.W_k(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
v = self.W_v(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
out, weights = self.attention(q, k, v, mask)
# Concatenate heads: (batch, heads, seq, d_k) -> (batch, seq, d_model)
out = out.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
return self.W_o(out), weights
The .view() and .transpose() calls are doing all the heavy lifting. We reshape from (batch, seq_len, d_model) to (batch, num_heads, seq_len, d_k), run attention in parallel across all heads, then concatenate the results back. With d_model = 64 and num_heads = 4, each head operates in a 16-dimensional subspace. The final W_o projection recombines these subspaces into a single representation.
Constructing the Encoder and Decoder Layers
The Transformer architecture is symmetric, built of layered building blocks.
The Encoder Layer consists of:
- Multi-Head Self-Attention
- Residual Connection & Layer Normalization
- Position-wise Feed-Forward Network
- Residual Connection & Layer Normalization
The residual connections are essential. Without them, gradients must flow through the attention and feed-forward blocks sequentially, and deep stacks of layers would suffer from vanishing gradients. By adding the input directly to the output of each sub-layer, the gradient has a shortcut path back through the network. Layer normalization then stabilizes the scale of activations at each layer.
The Decoder Layer is slightly more complex. It adds a Cross-Attention block in the middle, which acts as the bridge between the encoder and the decoder. In cross-attention, the Queries come from the previous Decoder layer, while the Keys and Values are provided by the Encoder's final output. This asymmetry is what allows the decoder to "read" the encoded input while generating its own sequence.
The Future-Blind Mask
One of the trickiest parts of the implementation is the Look-Ahead Mask (or causal mask) in the decoder. When generating text sequentially, the model must be blind to future tokens during training (teacher forcing). We achieve this by applying a lower-triangular boolean mask matrix to the self-attention scores in the decoder.
tgt_lookahead_mask = torch.tril(torch.ones((tgt_len, tgt_len))).bool()
For a target sequence of length 5, this produces a 5x5 matrix where position 0 can only see position 0, position 1 can see positions 0-1, and so on. Combined with the padding mask (which zeros out attention to <PAD> tokens), this ensures the decoder respects both causality and sequence boundaries during training. Getting these two masks to broadcast correctly across the batch and head dimensions was one of the more subtle debugging challenges in the implementation.
What's Next
The model compiles and runs -- all 169,933 parameters -- but does it actually learn? Part 3 trains it on a sequence-reversal task and cracks open the cross-attention tensors to see how the model routes information.