Deconstructing GNNs: Part 1 - The Math of Message Passing

Abstract

Graph Neural Networks extend deep learning beyond grids and sequences to arbitrary relational structures. This first installment develops the mathematical foundations: how graphs are represented for computation, why spectral convolutions motivate the GCN layer, and how attention mechanisms generalize fixed aggregation. Every equation maps directly to the pure-PyTorch implementation in Part 2.

Why Graphs?

Convolutional Neural Networks assume data lives on a regular grid (images). Recurrent networks assume a sequential ordering (text, audio). But many of the most interesting datasets are graphs:

Social networks -- nodes are people, edges are friendships
Molecules -- nodes are atoms, edges are chemical bonds
Citation networks -- nodes are papers, edges are references
Knowledge graphs -- nodes are entities, edges are relations

A graph $G = (V, E)$ with $|V| = N$ nodes is compactly described by:

An adjacency matrix $\mathbf{A} \in \{0,1\}^{N \times N}$ where $A_{ij} = 1$ if edge $(i,j) \in E$.
A feature matrix $\mathbf{X} \in \mathbb{R}^{N \times F}$ where row $i$ holds the $F$-dimensional feature vector of node $i$.

The goal of a GNN is to learn a function $f: (\mathbf{X}, \mathbf{A}) \mapsto \mathbf{Z} \in \mathbb{R}^{N \times C}$ that produces useful node representations (e.g., for classification into $C$ classes) by exploiting both features and graph structure.

Graph Representation for Computation

Adjacency Matrix and Self-Loops

The raw adjacency matrix $\mathbf{A}$ encodes which nodes are connected but does not include a node's own features. A standard trick is to add self-loops:

\hat{\mathbf{A}} = \mathbf{A} + \mathbf{I}_N

so that every node aggregates its own representation alongside its neighbors'.

Degree Matrix and Normalization

The degree matrix of $\hat{\mathbf{A}}$ is:

\hat{\mathbf{D}}_{ii} = \sum_j \hat{A}_{ij}

Naively multiplying $\hat{\mathbf{A}}\mathbf{X}$ sums neighbor features, but high-degree nodes accumulate disproportionately large values. Two normalizations are common:

Row normalization: $\hat{\mathbf{D}}^{-1}\hat{\mathbf{A}}$ -- each row sums to 1 (mean aggregation).
Symmetric normalization: $\hat{\mathbf{D}}^{-1/2}\hat{\mathbf{A}}\hat{\mathbf{D}}^{-1/2}$ -- the normalization used by Kipf & Welling, which has a spectral justification.

Spectral Graph Convolution

On regular grids, convolution is multiplication in the Fourier domain. The graph analog uses the graph Laplacian:

\mathbf{L} = \mathbf{D} - \mathbf{A} \qquad\text{(unnormalized)}, \qquad \tilde{\mathbf{L}} = \mathbf{I} - \mathbf{D}^{-1/2}\mathbf{A}\mathbf{D}^{-1/2} \qquad\text{(normalized)}

The eigendecomposition $\tilde{\mathbf{L}} = \mathbf{U}\boldsymbol{\Lambda}\mathbf{U}^T$ defines the graph Fourier transform. A spectral convolution with filter $g_\theta$ is:

g_\theta \star \mathbf{x} = \mathbf{U}\, g_\theta(\boldsymbol{\Lambda})\, \mathbf{U}^T \mathbf{x}

Computing the full eigendecomposition is $O(N^3)$. Chebyshev polynomials approximate $g_\theta(\boldsymbol{\Lambda})$ without explicit eigenvectors. Kipf & Welling's key simplification: truncate the Chebyshev expansion to first order, yielding the now-famous GCN propagation rule.

The GCN Layer (Kipf & Welling, 2017)

Starting from the first-order Chebyshev approximation and applying a renormalization trick, the Graph Convolutional layer computes:

\mathbf{H}^{(\ell+1)} = \sigma\!\left( \hat{\mathbf{D}}^{-1/2}\,\hat{\mathbf{A}}\,\hat{\mathbf{D}}^{-1/2}\; \mathbf{H}^{(\ell)}\,\mathbf{W}^{(\ell)} \right)

where:

$\mathbf{H}^{(0)} = \mathbf{X}$ (input features)
$\mathbf{W}^{(\ell)} \in \mathbb{R}^{F_\ell \times F_{\ell+1}}$ is a learnable weight matrix
$\sigma$ is a nonlinearity (typically ReLU)

This equation can be read as three steps:

Transform: Project features via $\mathbf{W}$
Aggregate: Each node averages its neighbors' (transformed) features, weighted by degree normalization
Activate: Apply nonlinearity

Stacking $L$ layers gives each node a receptive field of $L$-hop neighbors. A 2-layer GCN can capture interactions between nodes up to 2 hops apart -- sufficient for many community detection tasks.

Attention on Graphs: The GAT Layer

Motivation

GCN assigns fixed weights to neighbors based on degree. But not all neighbors are equally informative. Graph Attention Networks (Velickovic et al., 2018) learn adaptive weights.

Attention Mechanism

For each edge $(i, j)$:

e_{ij} = \text{LeakyReLU}\!\left( \mathbf{a}^T \left[\mathbf{W}\mathbf{h}_i \,\|\, \mathbf{W}\mathbf{h}_j\right] \right)

\alpha_{ij} = \text{softmax}_j(e_{ij}) = \frac{\exp(e_{ij})}{\sum_{k \in \mathcal{N}(i)} \exp(e_{ik})}

\mathbf{h}_i' = \sigma\!\left( \sum_{j \in \mathcal{N}(i)} \alpha_{ij}\,\mathbf{W}\mathbf{h}_j \right)

where $\mathbf{W} \in \mathbb{R}^{F \times F'}$ is a shared linear projection and $\mathbf{a} \in \mathbb{R}^{2F'}$ is the attention vector.

Multi-Head Attention

To stabilize training, GAT uses $K$ independent attention heads:

\mathbf{h}_i' = \Big\|_{k=1}^{K}\; \sigma\!\left( \sum_{j \in \mathcal{N}(i)} \alpha_{ij}^{(k)}\,\mathbf{W}^{(k)}\mathbf{h}_j \right)

where $\|$ denotes concatenation. For the output layer, heads are averaged instead of concatenated to produce $C$-dimensional logits.

GCN vs. GAT: When to Use Which

Property	GCN	GAT
Neighbor weighting	Fixed (degree-based)	Learned (attention)
Parameters per layer	$O(F \cdot F')$	$O(K(F \cdot F' + 2F'))$
Computational cost	Lower	Higher (attention scores)
Best for	Homogeneous graphs	Heterogeneous importance

Looking Ahead

In Part 2, we implement both layers in pure PyTorch -- no graph libraries. We will see how the normalized adjacency matrix and attention scores translate to ~150 lines of actual code.

In Part 3, we train both models on Zachary's Karate Club (34 nodes, 78 edges) with only 4 labeled nodes and compare convergence, accuracy, and the learned representations.

Deconstructing GNNs from Scratch

Part 1: The Math of Message Passing