Transformers from Scratch: Part 3 - Training and Visualizing

Introduction

Welcome to the grand finale of our 3-part series on building a Transformer from first principles!

In Part 1, we deconstructed the mathematics behind Multi-Head Self-Attention. In Part 2, we wrote a pure PyTorch implementation of the full Sequence-to-Sequence architecture, navigating the tricky landscape of causality masks and parallel batching.

However, a model implementation is only as good as its ability to learn. Today, we will define a training objective, execute the training loop, and extract the generated attention maps to prove our Transformer learns to route information dynamically.

The Training Task: Sequence Reversal

To benchmark our model, we selected a toy Sequence-to-Sequence task: integer sequence reversal.

The task is simple but incredibly effective at evaluating attention mechanisms: given a sequence of integers like <BOS> 4 8 2 9 <EOS>, the model must output <BOS> 9 2 8 4 <EOS>.

We generated an on-the-fly dataset of $10,000$ random short sequences. We then trained the full Transformer using an Adam optimizer and Cross-Entropy Loss to enforce accurate token prediction over an abstract vocabulary of size 13 (covering padding, start, end, and digits).

The Loss Curve

Running the custom train.py script yields an expected training trajectory:

Figure 1: Training Loss Curve showing rapid convergence for Sequence Reversal

The loss plummets extremely quickly. Transformers converge fast on perfectly deterministic symbolic tasks. The model first learns the positional token distribution and grammar, rapidly decreasing to near-zero as it masters the exact mapping.

Staring into the Attention Matrix

The true power of this project is our ability to peer inside the "black box." How is the decoder making its predictions?

During inference, we intercepted the cross-attention tensors inside the final DecoderLayer. We extracted an attention map of dimension (Target_Length, Source_Length):

Cross-Attention Map — Figure 2: Learned Cross-Attention map displaying anti-diagonal probability distributions.

In the heatmap above, you observe a stark, crisp anti-diagonal line tracing from the top right to the bottom left.

What does this mean? It signifies that when the decoder generates the first vocabulary token, its query vector achieves maximum activation strictly with the key vector of the last sequence token from the encoder. The Transformer learned that to reverse a sequence, it must route information sequentially backward purely by altering attention weights across positions.

Conclusion

Through these 3 posts, we have demystified one of the most vital architectures in modern machine learning. By coding the math without abstraction layers, we understand exactly how LLMs read and write context.

Stay tuned for the next series!