Deconstructing GNNs: Part 3 - Classifying Nodes in Graphs

Abstract

We evaluate the GCN and GAT architectures from Parts 1--2 on the classic Zachary's Karate Club benchmark. Using only 4 labeled nodes out of 34 (semi-supervised), both models achieve 100% node classification accuracy. We analyze training dynamics, convergence behavior, and what the graph visualizations reveal about how message passing propagates labels through the network.

The Benchmark: Zachary's Karate Club

Wayne Zachary's 1977 study documented the social network of a university karate club that split into two factions after a dispute between the instructor (Mr. Hi, node 0) and the club officer (node 33).

Property	Value
Nodes	34
Undirected edges	78
Classes	2 (Mr. Hi vs. Officer)
Node features	One-hot identity (34-dim)
Class 0 (Mr. Hi)	16 nodes
Class 1 (Officer)	18 nodes

Semi-Supervised Setup

Following the standard semi-supervised protocol:

Training: 2 randomly selected nodes per class (4 total): nodes 3, 6, 18, 20
Evaluation: All 34 nodes
Task: Predict which faction each member joined

The challenge: can 4 labels, propagated through the graph structure, correctly classify all 34 nodes?

Experimental Configuration

Hyperparameter	GCN	GAT
Architecture	34-16-2	34-[4x8=32]-2
Activation	ReLU	ELU
Dropout	0.5	0.6
Learning rate	0.01	0.005
Epochs	200	200
Attention heads	--	4 (layer 1), 1 (output)

Results

Accuracy

Both models achieve perfect separation of the two factions using only 4 labeled training nodes. GCN: 100.00%. GAT: 100.00%. Zero misclassified nodes.

Training Dynamics

The convergence behavior reveals important differences:

GCN converges faster. At epoch 1, GCN already classifies 79.4% of nodes correctly. The symmetric normalization provides an immediate structural prior: nodes that share many neighbors with a labeled node receive a strong signal from the very first forward pass. By epoch 50, both loss and accuracy have essentially plateaued.

GAT starts slower but has lower initial loss. GAT's accuracy at epoch 1 is only 64.7%, but its loss (0.648) is lower than GCN's (0.700). This apparent contradiction arises because GAT's attention weights are initially near-uniform, producing well-calibrated but indecisive predictions. As attention sharpens, accuracy jumps rapidly.

GAT's higher final loss. GAT's loss at epoch 200 (0.061) is 20$\times$ higher than GCN's (0.003). This is not a sign of worse performance -- both achieve 100% accuracy. The difference stems from GAT's stronger dropout (0.6 vs. 0.5) and the stochasticity of attention dropout, which prevents the loss from collapsing to near-zero during training mode.

Why Does Semi-Supervised Learning Work So Well?

Label Propagation via Message Passing

A 2-layer GNN gives each node access to its 2-hop neighborhood. In the Karate Club graph, the average shortest path length is approximately 2.4, meaning most nodes are within 2--3 hops of a labeled node.

The mechanism:

Layer 1: Each labeled node's class signal spreads to its immediate neighbors.
Layer 2: Those neighbors propagate the signal one hop further, reaching 2-hop neighborhoods.

With 4 strategically placed labels (nodes 3, 6, 18, 20), the 2-hop coverage encompasses nearly the entire graph.

Graph Structure as Implicit Regularization

Even without meaningful node features (we use one-hot identities), the adjacency matrix encodes community structure. The normalized propagation $\hat{\mathbf{D}}^{-1/2}\hat{\mathbf{A}}\hat{\mathbf{D}}^{-1/2}$ acts as a smoothing operator: connected nodes are pushed toward similar representations. Since the two factions are densely connected within groups and sparsely connected between groups, this smoothing naturally separates them.

The Homophily Assumption

Both GCN and GAT implicitly assume homophily -- that connected nodes tend to share labels. Zachary's Karate Club exhibits strong homophily (members befriend those in their faction), making it an ideal benchmark. On heterophilous graphs (where connected nodes tend to have different labels), standard GNNs struggle and require architectural modifications.

Limitations and Extensions

Small scale: 34 nodes is trivial by modern standards. Real GNN benchmarks use Cora (2,708 nodes), CiteSeer (3,327), or OGB datasets (millions of nodes).
No edge features: Our implementation handles binary adjacency. Real-world graphs often have typed or weighted edges.
Transductive only: Both models require the full adjacency at test time. Inductive variants (GraphSAGE) can generalize to unseen nodes.
Over-smoothing: Stacking too many GNN layers causes all node representations to converge, destroying discriminative information. Techniques like residual connections, jumping knowledge, and normalization help.

Conclusion

Building GCN and GAT from scratch in pure PyTorch reveals that graph neural networks are, at their core, remarkably simple:

Normalize the adjacency matrix (GCN) or compute attention (GAT)
Multiply by projected features
Repeat for $L$ layers

On Zachary's Karate Club, this procedure achieves perfect accuracy with only 4 labeled nodes -- a testament to the power of exploiting graph structure in semi-supervised learning.

The key takeaways:

Graph structure alone carries significant information about node labels in homophilous networks
GCN's fixed normalization converges faster; GAT's learned attention provides flexibility for heterogeneous graphs
Semi-supervised learning on graphs is remarkably data-efficient: labeling less than 12% of nodes sufficed for perfect classification

Deconstructing GNNs from Scratch

Part 3: Classifying Nodes in Graphs