Back to KANs Hub

Deconstructing Kolmogorov-Arnold Networks (KANs)

Part 1: The Mathematics of Splines on Edges

Introduction

For decades, the standard Multi-Layer Perceptron (MLP) has been the undisputed foundational building block of deep learning paradigms. The architecture is universally taught in week one of any machine learning course: take an input vector, multiply it by a dense weight matrix, sum the results at each node, and pass that scalar through a fixed, non-linear activation function like ReLU, Sigmoid, or Tanh.

We place the linear operations on the edges (the weights) and the non-linear operations on the nodes (the activation functions). But what if we inverted this paradigm? What if the nodes simply summed the inputs, and the non-linear activation functions lived directly on the edges?

This is the core premise of Kolmogorov-Arnold Networks (KANs). In this 3-part series, we will completely deconstruct KANs, moving from pure mathematical theory to a functional PyTorch implementation, and finally, a benchmark comparing them against traditional MLPs.

In Part 1, we will explore the foundational mathematics that makes KANs possible: The Kolmogorov-Arnold Representation Theorem and the mechanics of B-Splines.

1. The Kolmogorov-Arnold Representation Theorem

The theoretical foundation of KANs rests on a mathematical theorem proven by Vladimir Arnold and Andrey Kolmogorov in 1957. The theorem states, rather astonishingly, that any multivariate continuous function can be represented as a finite composition of continuous functions of a single variable and the operation of addition.

Mathematically, for a continuous function $f : [0,1]^n \to \mathbb{R}$, the theorem is expressed as:

$$ f(x_1, \dots, x_n) = \sum_{q=0}^{2n} \Phi_q \left( \sum_{p=1}^{n} \phi_{q,p}(x_p) \right) $$

Where:

Implications for Neural Architecture

If we view the universal approximation capabilities of MLPs, they rely on depth and width (a large number of nodes) to approximate complex functions, using fixed non-linearities. However, the Kolmogorov-Arnold theorem suggests that we do not need complex multivariate mappings at all. We only need univariate (1D) functions and summation.

In a KAN layer architecture, instead of a matrix $W \in \mathbb{R}^{out \times in}$, we have a grid of 1D functions $\phi_{i,j}$, where each function connects input $i$ to output $j$.

The output $y_j$ of a KAN node is simply the sum of these non-linear edge functions applied to the respective inputs:

$$ y_j = \sum_{i=1}^{n_{in}} \phi_{i,j}(x_i) $$

The network learns the functions themselves, rather than just scalar weights.

2. Parameterizing Edge Functions: B-Splines

To make KANs practical in deep learning, we need a differentiable, expressive way to parameterize these 1D edge functions $\phi(x)$. We cannot simply learn arbitrary continuous functions without a basis.

The authors of KAN propose using Basis Splines (B-splines). A B-spline is a piecewise polynomial curve defined by a set of control points and a knot vector. This provides localized control: adjusting one parameter of the spline only affects a local region of the function, preventing catastrophic forgetting and enabling highly stable optimization.

For an edge function $\phi(x)$, it is decomposed into a residual base activation (like SiLU) and a parameterized spline:

$$ \phi(x) = w_b \cdot \text{SiLU}(x) + w_s \cdot \text{Spline}(x) $$

The spline itself is a linear combination of B-spline basis functions $B_i(x)$:

$$ \text{Spline}(x) = \sum_{i=1}^{c} c_i B_i(x) $$

Here, $c_i$ are the learnable coefficients (the "weights" of the network), and $B_i(x)$ are the fixed polynomial basis functions evaluated at $x$.

3. Why does this matter?

By pushing the non-linearity to the edges and using splines, KANs offer several profound advantages over MLPs:

Next Steps: Building it in PyTorch

The math is beautiful, but the true test is translating formulas into tensor operations. In Part 2 of this series, we will drop the theory and open up an IDE. We will construct the B-spline basis evaluations and build a 1D KAN layer in pure PyTorch in under 100 lines of code.