Back to KANs Hub

Deconstructing Kolmogorov-Arnold Networks (KANs)

Part 1: The Mathematics of Splines on Edges

Introduction

The MLP is the first thing you learn in any ML course: multiply inputs by a weight matrix, sum at each node, apply a fixed activation like ReLU or Tanh. Linear operations live on the edges (weights), non-linear operations live on the nodes (activations).

What if we flipped that? Nodes just sum. The learnable, non-linear activation functions move to the edges. That is the core idea behind Kolmogorov-Arnold Networks (KANs).

This post covers the math that makes KANs work: the Kolmogorov-Arnold Representation Theorem and B-spline parameterization.

1. The Kolmogorov-Arnold Representation Theorem

KANs rest on a theorem proved by Arnold and Kolmogorov in 1957: any multivariate continuous function can be decomposed into a finite composition of continuous single-variable functions and addition.

For a continuous function $f : [0,1]^n \to \mathbb{R}$:

$$ f(x_1, \dots, x_n) = \sum_{q=0}^{2n} \Phi_q \left( \sum_{p=1}^{n} \phi_{q,p}(x_p) \right) $$

Where:

Implications for Neural Architecture

MLPs approximate complex functions through depth and width -- lots of nodes with fixed non-linearities. The Kolmogorov-Arnold theorem says you don't need multivariate mappings at all. Univariate functions plus addition are enough.

In a KAN layer, instead of a weight matrix $W \in \mathbb{R}^{out \times in}$, you have a grid of 1D functions $\phi_{i,j}$ connecting input $i$ to output $j$.

The output of a KAN node is the sum of non-linear edge functions applied to each input:

$$ y_j = \sum_{i=1}^{n_{in}} \phi_{i,j}(x_i) $$

The network learns the functions themselves, not just scalar weights.

2. Parameterizing Edge Functions: B-Splines

We need a differentiable way to parameterize the 1D edge functions $\phi(x)$. You can't learn arbitrary continuous functions without choosing a basis.

The KAN paper uses B-splines: piecewise polynomial curves defined by control points and a knot vector. The key property is locality -- adjusting one spline parameter only changes a local region of the function, which keeps optimization stable and limits catastrophic forgetting.

Each edge function $\phi(x)$ decomposes into a residual base activation (SiLU) and a learned spline:

$$ \phi(x) = w_b \cdot \text{SiLU}(x) + w_s \cdot \text{Spline}(x) $$

The spline is a linear combination of B-spline basis functions $B_i(x)$:

$$ \text{Spline}(x) = \sum_{i=1}^{c} c_i B_i(x) $$

Here, $c_i$ are the learnable coefficients (the "weights" of the network), and $B_i(x)$ are the fixed polynomial basis functions evaluated at $x$.

3. Why does this matter?

Moving non-linearity to the edges and parameterizing it with splines gives KANs concrete advantages over MLPs: