The MLP is the first thing you learn in any ML course: multiply inputs by a weight matrix, sum at each node, apply a fixed activation like ReLU or Tanh. Linear operations live on the edges (weights), non-linear operations live on the nodes (activations).
What if we flipped that? Nodes just sum. The learnable, non-linear activation functions move to the edges. That is the core idea behind Kolmogorov-Arnold Networks (KANs).
This post covers the math that makes KANs work: the Kolmogorov-Arnold Representation Theorem and B-spline parameterization.
1. The Kolmogorov-Arnold Representation Theorem
KANs rest on a theorem proved by Arnold and Kolmogorov in 1957: any multivariate continuous function can be decomposed into a finite composition of continuous single-variable functions and addition.
For a continuous function $f : [0,1]^n \to \mathbb{R}$:
Where:
- $\phi_{q,p} : [0,1] \to \mathbb{R}$ are inner functions (mapping 1D to 1D).
- $\Phi_q : \mathbb{R} \to \mathbb{R}$ are outer functions (also mapping 1D to 1D).
- We sum over the inputs $p=1$ to $n$, and then sum the outer compositions $q=0$ to $2n$.
Implications for Neural Architecture
MLPs approximate complex functions through depth and width -- lots of nodes with fixed non-linearities. The Kolmogorov-Arnold theorem says you don't need multivariate mappings at all. Univariate functions plus addition are enough.
In a KAN layer, instead of a weight matrix $W \in \mathbb{R}^{out \times in}$, you have a grid of 1D functions $\phi_{i,j}$ connecting input $i$ to output $j$.
The output of a KAN node is the sum of non-linear edge functions applied to each input:
The network learns the functions themselves, not just scalar weights.
2. Parameterizing Edge Functions: B-Splines
We need a differentiable way to parameterize the 1D edge functions $\phi(x)$. You can't learn arbitrary continuous functions without choosing a basis.
The KAN paper uses B-splines: piecewise polynomial curves defined by control points and a knot vector. The key property is locality -- adjusting one spline parameter only changes a local region of the function, which keeps optimization stable and limits catastrophic forgetting.
Each edge function $\phi(x)$ decomposes into a residual base activation (SiLU) and a learned spline:
The spline is a linear combination of B-spline basis functions $B_i(x)$:
Here, $c_i$ are the learnable coefficients (the "weights" of the network), and $B_i(x)$ are the fixed polynomial basis functions evaluated at $x$.
3. Why does this matter?
Moving non-linearity to the edges and parameterizing it with splines gives KANs concrete advantages over MLPs:
- Interpretability: Edge functions are 1D splines, so you can plot them directly. If the network learns a $\sin(x)$ or $x^2$ mapping, you see the curve on the edge. Good luck extracting that from a 10,000x10,000 weight matrix.
- Parameter Efficiency in Symbolic Tasks: On physics and math problems, KANs often match or beat MLP accuracy with orders of magnitude fewer parameters, because they learn the symbolic function shape directly.
- Grid Extension: You can increase B-spline grid resolution after training without retraining from scratch.