This second installment translates the GCN and GAT equations from Part 1 into working Python code using only torch.nn and torch.Tensor operations. We walk through adjacency normalization, the GCN forward pass, GAT attention computation, and multi-head concatenation. The complete implementation is ~150 lines with no external graph libraries.
Design Philosophy
Our implementation follows three rules:
- Pure PyTorch -- no
torch_geometric, nodgl, no external graph libraries. - Readable -- each mathematical operation maps to one or two lines of code.
- Correct -- numerically stable, with proper initialization and masking.
Adjacency Normalization
The most important utility function computes the symmetric normalization from Part 1:
@staticmethod
def compute_norm_adj(adj: torch.Tensor) -> torch.Tensor:
# A_hat = A + I (self-loops)
a_hat = adj + torch.eye(adj.size(0), device=adj.device)
# Degree vector
d_hat = a_hat.sum(dim=1) # (N,)
# D^{-1/2}
d_inv_sqrt = torch.pow(d_hat, -0.5)
d_inv_sqrt[torch.isinf(d_inv_sqrt)] = 0.0
# D^{-1/2} A_hat D^{-1/2}
norm_adj = d_inv_sqrt.unsqueeze(1) * a_hat * d_inv_sqrt.unsqueeze(0)
return norm_adj
Key details:
- Self-loops (
+ torch.eye) ensure every node includes its own features in the aggregation. - The
isinfguard handles isolated nodes (degree 0) that would cause division by zero. - The outer product trick avoids constructing a full diagonal matrix, saving memory.
GCN Layer Implementation
class GCNLayer(nn.Module):
def __init__(self, in_features, out_features,
activation=True, bias=True):
super().__init__()
self.linear = nn.Linear(in_features, out_features, bias=bias)
self.activation = activation
def forward(self, x, adj):
norm_adj = self.compute_norm_adj(adj) # (N, N)
support = self.linear(x) # (N, out)
out = torch.mm(norm_adj, support) # (N, out)
if self.activation:
out = F.relu(out)
return out
The order of operations matters:
- Transform first ($\mathbf{XW}$): reduces dimensionality before the expensive matrix multiply with the adjacency.
- Then aggregate ($\hat{\mathbf{A}}_{\text{norm}} \cdot \mathbf{XW}$): propagates the already-projected features.
This is more efficient than aggregating first when $F' < F$, which is the common case in later layers.
GAT Layer Implementation
The GAT layer is more involved. We break it into four stages.
Stage 1: Linear Projection
# W: (K, F_in, F_out) -- K independent heads
h = torch.einsum("nf,kfo->kno", x, self.W) # (K, N, F_out)
Each of the $K$ attention heads has its own weight matrix. The einsum computes all $K$ projections in one fused operation.
Stage 2: Attention Scores
e_left = torch.matmul(h, self.a_left).squeeze(-1) # (K, N)
e_right = torch.matmul(h, self.a_right).squeeze(-1) # (K, N)
# e_ij = e_left_i + e_right_j via broadcasting
e = e_left.unsqueeze(2) + e_right.unsqueeze(1) # (K, N, N)
e = self.leaky_relu(e)
The original GAT paper concatenates $[\mathbf{W}\mathbf{h}_i \,\|\, \mathbf{W}\mathbf{h}_j]$ and applies the attention vector $\mathbf{a}$. Since $\mathbf{a}$ acts on the two halves independently, we can split it into $\mathbf{a}_{\text{left}}$ and $\mathbf{a}_{\text{right}}$ and use broadcasting instead of explicit concatenation. This avoids creating an $O(N^2 F')$ intermediate tensor.
Stage 3: Masked Softmax
mask = (adj_hat == 0).unsqueeze(0) # (1, N, N)
e = e.masked_fill(mask, float("-inf"))
alpha = F.softmax(e, dim=2) # (K, N, N)
alpha = self.dropout(alpha) # attention dropout
Non-edges receive $-\infty$ before softmax, ensuring $\alpha_{ij} = 0$ for disconnected pairs. Dropout on attention coefficients is a regularization technique from the original GAT paper.
Stage 4: Aggregation and Multi-Head Output
out = torch.bmm(alpha, h) # (K, N, F_out)
if self.concat:
# Concatenate K heads -> (N, K * F_out)
out = out.permute(1, 0, 2).contiguous().view(N, -1)
else:
# Average K heads -> (N, F_out)
out = out.mean(dim=0)
Hidden layers concatenate heads (increasing expressivity); the output layer averages them (producing $C$-dimensional logits).
Full Model Architectures
GCN Model
class GCN(nn.Module):
def __init__(self, n_features, n_hidden, n_classes, dropout=0.5):
super().__init__()
self.gc1 = GCNLayer(n_features, n_hidden, activation=True)
self.gc2 = GCNLayer(n_hidden, n_classes, activation=False)
self.dropout = nn.Dropout(dropout)
def forward(self, x, adj):
x = self.gc1(x, adj)
x = self.dropout(x)
x = self.gc2(x, adj)
return x # raw logits
GAT Model
class GAT(nn.Module):
def __init__(self, n_features, n_hidden, n_classes,
n_heads=4, dropout=0.6):
super().__init__()
self.attn1 = GATLayer(n_features, n_hidden,
n_heads=n_heads, concat=True,
dropout=dropout)
self.attn2 = GATLayer(n_hidden * n_heads, n_classes,
n_heads=1, concat=False,
dropout=dropout)
self.dropout = nn.Dropout(dropout)
def forward(self, x, adj):
x = self.dropout(x)
x = self.attn1(x, adj)
x = F.elu(x)
x = self.dropout(x)
x = self.attn2(x, adj)
return x
ELU activation (not ReLU) is used between GAT layers, following the original paper. Input dropout is also applied, which provides additional regularization.
Parameter Initialization
Both layers use Xavier uniform initialization:
nn.init.xavier_uniform_(self.linear.weight)
nn.init.zeros_(self.linear.bias)
For GAT, all three parameter tensors ($\mathbf{W}$, $\mathbf{a}_\text{left}$, $\mathbf{a}_\text{right}$) are Xavier-initialized. This ensures that the variance of activations is preserved through layers at initialization, which is critical for stable training with LeakyReLU attention.
Summary
| Component | GCN | GAT |
|---|---|---|
| Lines of code (layer) | 35 | 55 |
| Lines of code (model) | 15 | 20 |
| Parameters (our config) | 610 | 1,378 |
| External dependencies | None | None |
In Part 3, we put these models to the test on Zachary's Karate Club -- a real social network where the graph structure alone must separate two factions using only 4 labeled nodes.