Back to GNNs Hub

Deconstructing GNNs from Scratch

Part 2: PyTorch Implementation

Abstract

This second installment translates the GCN and GAT equations from Part 1 into working Python code using only torch.nn and torch.Tensor operations. We walk through adjacency normalization, the GCN forward pass, GAT attention computation, and multi-head concatenation. The complete implementation is ~150 lines with no external graph libraries.

Design Philosophy

Our implementation follows three rules:

  1. Pure PyTorch -- no torch_geometric, no dgl, no external graph libraries.
  2. Readable -- each mathematical operation maps to one or two lines of code.
  3. Correct -- numerically stable, with proper initialization and masking.

Adjacency Normalization

The most important utility function computes the symmetric normalization from Part 1:

@staticmethod
def compute_norm_adj(adj: torch.Tensor) -> torch.Tensor:
    # A_hat = A + I  (self-loops)
    a_hat = adj + torch.eye(adj.size(0), device=adj.device)
    # Degree vector
    d_hat = a_hat.sum(dim=1)                        # (N,)
    # D^{-1/2}
    d_inv_sqrt = torch.pow(d_hat, -0.5)
    d_inv_sqrt[torch.isinf(d_inv_sqrt)] = 0.0
    # D^{-1/2} A_hat D^{-1/2}
    norm_adj = d_inv_sqrt.unsqueeze(1) * a_hat * d_inv_sqrt.unsqueeze(0)
    return norm_adj

Key details:

GCN Layer Implementation

class GCNLayer(nn.Module):
    def __init__(self, in_features, out_features,
                 activation=True, bias=True):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features, bias=bias)
        self.activation = activation

    def forward(self, x, adj):
        norm_adj = self.compute_norm_adj(adj)   # (N, N)
        support = self.linear(x)                # (N, out)
        out = torch.mm(norm_adj, support)       # (N, out)
        if self.activation:
            out = F.relu(out)
        return out

The order of operations matters:

  1. Transform first ($\mathbf{XW}$): reduces dimensionality before the expensive matrix multiply with the adjacency.
  2. Then aggregate ($\hat{\mathbf{A}}_{\text{norm}} \cdot \mathbf{XW}$): propagates the already-projected features.

This is more efficient than aggregating first when $F' < F$, which is the common case in later layers.

GAT Layer Implementation

The GAT layer is more involved. We break it into four stages.

Stage 1: Linear Projection

# W: (K, F_in, F_out)  -- K independent heads
h = torch.einsum("nf,kfo->kno", x, self.W)   # (K, N, F_out)

Each of the $K$ attention heads has its own weight matrix. The einsum computes all $K$ projections in one fused operation.

Stage 2: Attention Scores

e_left  = torch.matmul(h, self.a_left).squeeze(-1)   # (K, N)
e_right = torch.matmul(h, self.a_right).squeeze(-1)   # (K, N)

# e_ij = e_left_i + e_right_j  via broadcasting
e = e_left.unsqueeze(2) + e_right.unsqueeze(1)         # (K, N, N)
e = self.leaky_relu(e)

The original GAT paper concatenates $[\mathbf{W}\mathbf{h}_i \,\|\, \mathbf{W}\mathbf{h}_j]$ and applies the attention vector $\mathbf{a}$. Since $\mathbf{a}$ acts on the two halves independently, we can split it into $\mathbf{a}_{\text{left}}$ and $\mathbf{a}_{\text{right}}$ and use broadcasting instead of explicit concatenation. This avoids creating an $O(N^2 F')$ intermediate tensor.

Stage 3: Masked Softmax

mask = (adj_hat == 0).unsqueeze(0)             # (1, N, N)
e = e.masked_fill(mask, float("-inf"))
alpha = F.softmax(e, dim=2)                    # (K, N, N)
alpha = self.dropout(alpha)                    # attention dropout

Non-edges receive $-\infty$ before softmax, ensuring $\alpha_{ij} = 0$ for disconnected pairs. Dropout on attention coefficients is a regularization technique from the original GAT paper.

Stage 4: Aggregation and Multi-Head Output

out = torch.bmm(alpha, h)                     # (K, N, F_out)

if self.concat:
    # Concatenate K heads -> (N, K * F_out)
    out = out.permute(1, 0, 2).contiguous().view(N, -1)
else:
    # Average K heads -> (N, F_out)
    out = out.mean(dim=0)

Hidden layers concatenate heads (increasing expressivity); the output layer averages them (producing $C$-dimensional logits).

Full Model Architectures

GCN Model

class GCN(nn.Module):
    def __init__(self, n_features, n_hidden, n_classes, dropout=0.5):
        super().__init__()
        self.gc1 = GCNLayer(n_features, n_hidden, activation=True)
        self.gc2 = GCNLayer(n_hidden, n_classes, activation=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, adj):
        x = self.gc1(x, adj)
        x = self.dropout(x)
        x = self.gc2(x, adj)
        return x  # raw logits

GAT Model

class GAT(nn.Module):
    def __init__(self, n_features, n_hidden, n_classes,
                 n_heads=4, dropout=0.6):
        super().__init__()
        self.attn1 = GATLayer(n_features, n_hidden,
                              n_heads=n_heads, concat=True,
                              dropout=dropout)
        self.attn2 = GATLayer(n_hidden * n_heads, n_classes,
                              n_heads=1, concat=False,
                              dropout=dropout)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, adj):
        x = self.dropout(x)
        x = self.attn1(x, adj)
        x = F.elu(x)
        x = self.dropout(x)
        x = self.attn2(x, adj)
        return x

ELU activation (not ReLU) is used between GAT layers, following the original paper. Input dropout is also applied, which provides additional regularization.

Parameter Initialization

Both layers use Xavier uniform initialization:

nn.init.xavier_uniform_(self.linear.weight)
nn.init.zeros_(self.linear.bias)

For GAT, all three parameter tensors ($\mathbf{W}$, $\mathbf{a}_\text{left}$, $\mathbf{a}_\text{right}$) are Xavier-initialized. This ensures that the variance of activations is preserved through layers at initialization, which is critical for stable training with LeakyReLU attention.

Summary

Component GCN GAT
Lines of code (layer) 35 55
Lines of code (model) 15 20
Parameters (our config) 610 1,378
External dependencies None None

In Part 3, we put these models to the test on Zachary's Karate Club -- a real social network where the graph structure alone must separate two factions using only 4 labeled nodes.