Deconstructing KANs: Part 1 - Mathematics

Introduction

The MLP is the first thing you learn in any ML course: multiply inputs by a weight matrix, sum at each node, apply a fixed activation like ReLU or Tanh. Linear operations live on the edges (weights), non-linear operations live on the nodes (activations).

What if we flipped that? Nodes just sum. The learnable, non-linear activation functions move to the edges. That is the core idea behind Kolmogorov-Arnold Networks (KANs).

This post covers the math that makes KANs work: the Kolmogorov-Arnold Representation Theorem and B-spline parameterization.

1. The Kolmogorov-Arnold Representation Theorem

KANs rest on a theorem proved by Arnold and Kolmogorov in 1957: any multivariate continuous function can be decomposed into a finite composition of continuous single-variable functions and addition.

For a continuous function $f : [0,1]^n \to \mathbb{R}$:

f(x_1, \dots, x_n) = \sum_{q=0}^{2n} \Phi_q \left( \sum_{p=1}^{n} \phi_{q,p}(x_p) \right)

Where:

$\phi_{q,p} : [0,1] \to \mathbb{R}$ are inner functions (mapping 1D to 1D).
$\Phi_q : \mathbb{R} \to \mathbb{R}$ are outer functions (also mapping 1D to 1D).
We sum over the inputs $p=1$ to $n$, and then sum the outer compositions $q=0$ to $2n$.

Implications for Neural Architecture

MLPs approximate complex functions through depth and width -- lots of nodes with fixed non-linearities. The Kolmogorov-Arnold theorem says you don't need multivariate mappings at all. Univariate functions plus addition are enough.

In a KAN layer, instead of a weight matrix $W \in \mathbb{R}^{out \times in}$, you have a grid of 1D functions $\phi_{i,j}$ connecting input $i$ to output $j$.

The output of a KAN node is the sum of non-linear edge functions applied to each input:

y_j = \sum_{i=1}^{n_{in}} \phi_{i,j}(x_i)

The network learns the functions themselves, not just scalar weights.

2. Parameterizing Edge Functions: B-Splines

We need a differentiable way to parameterize the 1D edge functions $\phi(x)$. You can't learn arbitrary continuous functions without choosing a basis.

The KAN paper uses B-splines: piecewise polynomial curves defined by control points and a knot vector. The key property is locality -- adjusting one spline parameter only changes a local region of the function, which keeps optimization stable and limits catastrophic forgetting.

Each edge function $\phi(x)$ decomposes into a residual base activation (SiLU) and a learned spline:

\phi(x) = w_b \cdot \text{SiLU}(x) + w_s \cdot \text{Spline}(x)

The spline is a linear combination of B-spline basis functions $B_i(x)$:

\text{Spline}(x) = \sum_{i=1}^{c} c_i B_i(x)

Here, $c_i$ are the learnable coefficients (the "weights" of the network), and $B_i(x)$ are the fixed polynomial basis functions evaluated at $x$.

3. Why does this matter?

Moving non-linearity to the edges and parameterizing it with splines gives KANs concrete advantages over MLPs:

Interpretability: Edge functions are 1D splines, so you can plot them directly. If the network learns a $\sin(x)$ or $x^2$ mapping, you see the curve on the edge. Good luck extracting that from a 10,000x10,000 weight matrix.
Parameter Efficiency in Symbolic Tasks: On physics and math problems, KANs often match or beat MLP accuracy with orders of magnitude fewer parameters, because they learn the symbolic function shape directly.
Grid Extension: You can increase B-spline grid resolution after training without retraining from scratch.

Deconstructing Kolmogorov-Arnold Networks (KANs)

Part 1: The Mathematics of Splines on Edges

1. The Kolmogorov-Arnold Representation Theorem

Implications for Neural Architecture

2. Parameterizing Edge Functions: B-Splines

3. Why does this matter?