Graph Neural Networks extend deep learning beyond grids and sequences to arbitrary relational structures. This first installment develops the mathematical foundations: how graphs are represented for computation, why spectral convolutions motivate the GCN layer, and how attention mechanisms generalize fixed aggregation. Every equation maps directly to the pure-PyTorch implementation in Part 2.
Why Graphs?
Convolutional Neural Networks assume data lives on a regular grid (images). Recurrent networks assume a sequential ordering (text, audio). But many of the most interesting datasets are graphs:
- Social networks -- nodes are people, edges are friendships
- Molecules -- nodes are atoms, edges are chemical bonds
- Citation networks -- nodes are papers, edges are references
- Knowledge graphs -- nodes are entities, edges are relations
A graph $G = (V, E)$ with $|V| = N$ nodes is compactly described by:
- An adjacency matrix $\mathbf{A} \in \{0,1\}^{N \times N}$ where $A_{ij} = 1$ if edge $(i,j) \in E$.
- A feature matrix $\mathbf{X} \in \mathbb{R}^{N \times F}$ where row $i$ holds the $F$-dimensional feature vector of node $i$.
The goal of a GNN is to learn a function $f: (\mathbf{X}, \mathbf{A}) \mapsto \mathbf{Z} \in \mathbb{R}^{N \times C}$ that produces useful node representations (e.g., for classification into $C$ classes) by exploiting both features and graph structure.
Graph Representation for Computation
Adjacency Matrix and Self-Loops
The raw adjacency matrix $\mathbf{A}$ encodes which nodes are connected but does not include a node's own features. A standard trick is to add self-loops:
so that every node aggregates its own representation alongside its neighbors'.
Degree Matrix and Normalization
The degree matrix of $\hat{\mathbf{A}}$ is:
Naively multiplying $\hat{\mathbf{A}}\mathbf{X}$ sums neighbor features, but high-degree nodes accumulate disproportionately large values. Two normalizations are common:
- Row normalization: $\hat{\mathbf{D}}^{-1}\hat{\mathbf{A}}$ -- each row sums to 1 (mean aggregation).
- Symmetric normalization: $\hat{\mathbf{D}}^{-1/2}\hat{\mathbf{A}}\hat{\mathbf{D}}^{-1/2}$ -- the normalization used by Kipf & Welling, which has a spectral justification.
Spectral Graph Convolution
On regular grids, convolution is multiplication in the Fourier domain. The graph analog uses the graph Laplacian:
The eigendecomposition $\tilde{\mathbf{L}} = \mathbf{U}\boldsymbol{\Lambda}\mathbf{U}^T$ defines the graph Fourier transform. A spectral convolution with filter $g_\theta$ is:
Computing the full eigendecomposition is $O(N^3)$. Chebyshev polynomials approximate $g_\theta(\boldsymbol{\Lambda})$ without explicit eigenvectors. Kipf & Welling's key simplification: truncate the Chebyshev expansion to first order, yielding the now-famous GCN propagation rule.
The GCN Layer (Kipf & Welling, 2017)
Starting from the first-order Chebyshev approximation and applying a renormalization trick, the Graph Convolutional layer computes:
where:
- $\mathbf{H}^{(0)} = \mathbf{X}$ (input features)
- $\mathbf{W}^{(\ell)} \in \mathbb{R}^{F_\ell \times F_{\ell+1}}$ is a learnable weight matrix
- $\sigma$ is a nonlinearity (typically ReLU)
This equation can be read as three steps:
- Transform: Project features via $\mathbf{W}$
- Aggregate: Each node averages its neighbors' (transformed) features, weighted by degree normalization
- Activate: Apply nonlinearity
Stacking $L$ layers gives each node a receptive field of $L$-hop neighbors. A 2-layer GCN can capture interactions between nodes up to 2 hops apart -- sufficient for many community detection tasks.
Attention on Graphs: The GAT Layer
Motivation
GCN assigns fixed weights to neighbors based on degree. But not all neighbors are equally informative. Graph Attention Networks (Velickovic et al., 2018) learn adaptive weights.
Attention Mechanism
For each edge $(i, j)$:
where $\mathbf{W} \in \mathbb{R}^{F \times F'}$ is a shared linear projection and $\mathbf{a} \in \mathbb{R}^{2F'}$ is the attention vector.
Multi-Head Attention
To stabilize training, GAT uses $K$ independent attention heads:
where $\|$ denotes concatenation. For the output layer, heads are averaged instead of concatenated to produce $C$-dimensional logits.
GCN vs. GAT: When to Use Which
| Property | GCN | GAT |
|---|---|---|
| Neighbor weighting | Fixed (degree-based) | Learned (attention) |
| Parameters per layer | $O(F \cdot F')$ | $O(K(F \cdot F' + 2F'))$ |
| Computational cost | Lower | Higher (attention scores) |
| Best for | Homogeneous graphs | Heterogeneous importance |
Looking Ahead
In Part 2, we implement both layers in pure PyTorch -- no graph libraries. We will see how the normalized adjacency matrix and attention scores translate to ~150 lines of actual code.
In Part 3, we train both models on Zachary's Karate Club (34 nodes, 78 edges) with only 4 labeled nodes and compare convergence, accuracy, and the learned representations.