Deconstructing TinyGPT: Part 3 - Learning Curriculum and Attention Inspection

Setup

Task. Given an 8-character prompt ABC+DEF=, emit a 4-character reversed sum. Model. TinyGPT with vocab=12, block=12, 4 layers, 4 heads, n_embd=64. Total trainable parameters (with weight tying): 200,832. Training. 10,000 synthetic examples, AdamW lr=5e-4 with cosine schedule, 50 epochs, batch 128. Evaluation. 1,000 held-out examples; exact-match accuracy plus per-position accuracy. Hardware. Apple M-series MPS backend.

Final Test Results

Metric	Value
Exact-match accuracy (full 4-digit answer)	1000/1000 = 100.0%
Position 0 (units, no carry-in)	100.0%
Position 1 (tens, carry from units)	100.0%
Position 2 (hundreds, carry from tens)	100.0%
Position 3 (thousands, carry from hundreds)	100.0%
Training time	98.1 s on MPS
Final training loss	0.0012

A 200,832-parameter model trained from scratch in 98 seconds solves three-digit addition perfectly on the held-out test set. The interesting story is not the final number — it is how the model got there.

The Learning Curriculum

Per-epoch per-position accuracy on the validation set reveals the order in which the model masters each digit:

Epoch	Loss	Exact	pos 0 (units)	pos 1 (tens)	pos 2 (hund.)	pos 3 (thou.)
1	1.979	0.003	0.106	0.104	0.093	0.898
7	1.624	0.003	0.088	0.098	0.454	0.961
12	0.994	0.084	0.834	0.123	0.561	0.945
15	0.203	0.907	0.999	0.963	0.936	0.997
19	0.018	1.000	1.000	1.000	1.000	1.000

The thousands digit is learned first

Position 3 reaches $96\%$ by epoch 7, while the other three positions are still near chance. This is not because thousands is mathematically easy — it is because in the training distribution it is almost always $0$ or $1$ (since $A, B < 1000 \Rightarrow A+B < 2000$). The model learns the marginal distribution of position 3 before it learns the function from $(A, B)$ to the digit. Same phenomenon as language models learning that "the" is the most likely next token before they learn syntax.

The tens digit is learned last

Position 1 is the slowest to converge — still at $30\%$ at epoch 13 when the units digit is already at $99\%$. The reason: the tens digit depends on the carry generated by the units of $A$ and $B$, which the model must compute internally. Among the four output positions, this is the one with the longest internal-dependency chain that is also non-marginal. By epoch 17 the model has consolidated this internal computation; by epoch 19 it is solved.

The curriculum reveals that the model is not memorising lookup tables — it is constructing internal carry-propagation circuitry, in the order the dependency structure imposes.

Why the Reversed Output Was Necessary

A causal Transformer predicts token $t$ using only tokens $0, \ldots, t-1$. In non-reversed left-to-right addition, the leftmost output digit depends on carries generated by the rightmost input columns — columns the model has not yet been asked to attend to. Worse: by the time the model is asked to emit the tens digit, it has already emitted the thousands digit, but the thousands digit logically depends on the tens carry. The causal mask forbids the model from looking ahead, so the non-causal dependency cannot be resolved.

Reversing the output makes every output digit depend only on already-attended inputs and already-emitted carries. The dependency graph is causal, and a causal Transformer can model it.

Attention Inspection

The attention weights from the final layer at the moment the model emits position 0 of the answer (units digit of the sum) are highest on the units of $A$ and $B$ in the prompt, consistent with column-wise addition. The model learned that columns exist and that the rightmost column produces the rightmost output, without any inductive bias hard-coding this.

What This Proves

A working decoder-only Transformer fits in $\sim$150 lines of PyTorch and trains to 100% on three-digit addition in 98 seconds.
The per-position learning curriculum reveals interpretable internal structure: marginals are learned before functions; dependency chains are mastered in dependency order.
How a task is framed determines whether a given architecture can learn it. Reversed-output is the difference between 100% and ~25% here — the same difference often appears in real LLM tasks, where prompt engineering and supervised fine-tuning are partial reframings of the dependency structure.

Full code on GitHub: github.com/soveshmohapatra/TinyGPT

Deconstructing TinyGPT from Scratch

Part 3: Learning Curriculum and Attention Inspection