Deconstructing KANs: Part 3 - Scaling & Benchmarks

Introduction

Part 1 covered the Kolmogorov-Arnold theorem. Part 2 turned it into a PyTorch module. Now we answer the obvious question: does a KAN actually outperform an MLP?

We benchmark both on symbolic regression -- a task where KANs should have a structural advantage.

View the benchmarking code on GitHub

1. The Experiment: Symbolic Regression

Symbolic regression means finding the mathematical expression that fits a dataset. MLPs struggle here because ReLU-based piecewise linear approximations can't efficiently capture high-frequency oscillations or multiplicative interactions without large width and depth. A ReLU network approximates curves by stitching together straight-line segments -- to trace a smooth sinusoid with reasonable fidelity, you need many neurons contributing many breakpoints, which means large hidden layers and high parameter counts.

KANs represent data with localized polynomial splines, which are a natural fit for tracing symbolic curves. Each B-spline basis function is active only within a small interval of the input domain, and the learnable coefficients control the shape of the curve locally. This means a KAN does not need to coordinate hundreds of neurons to approximate a smooth function -- a handful of spline segments can trace it directly.

The Target Function

We fit the following target:

y = \sin(3x) + \cos(5x) \cdot \exp(-x^2)

This combines high-frequency localized oscillations (the $\cos(5x)$ term, damped by exponential decay) with a lower-frequency global wave (the $\sin(3x)$ term). The exponential decay means the high-frequency component is only significant near $x = 0$ and vanishes in the tails -- exactly the kind of localized structure that splines handle well and ReLU networks struggle to isolate without overparameterization.

2. Model Architecture & Parameters

Parameters are distributed very differently in MLPs vs. KANs, so matching parameter counts matters for a fair comparison.

MLP with layers [1, 32, 32, 1]:

$L_1$: $1 \times 32 + 32 = 64$ parameters
$L_2$: $32 \times 32 + 32 = 1056$ parameters
$L_3$: $32 \times 1 + 1 = 33$ parameters

Total: $\approx \mathbf{1{,}150}$ parameters.

KAN with layers [1, 4, 1], grid size 10, spline order 3 (13 parameters per edge):

$L_1$: $1 \times 4 \times 13 = 52$ parameters
$L_2$: $4 \times 1 \times 13 = 52$ parameters
Base weights: $\approx 8$

Total: $\approx \mathbf{112}$ parameters -- roughly 10x fewer than the MLP.

The parameter distribution itself is telling. In the MLP, 1,056 of 1,150 parameters (92%) sit in a single 32x32 weight matrix in the second layer -- a dense block of scalar multipliers. In the KAN, every parameter is a spline coefficient that directly shapes a 1D function on an edge. There are no "dead" parameters: each one controls a visible segment of a curve you can plot and inspect.

3. Results

Both networks trained with Adam (lr $= 0.001$), MSE loss, for 10,000 epochs.

MLP: Loss plateaued around $0.0057$ MSE. The ReLU activations produced jagged approximations of the smooth target and couldn't resolve the high-frequency peaks. Around the origin, where the $\cos(5x) \cdot \exp(-x^2)$ term oscillates rapidly, the MLP's piecewise linear segments were too coarse to follow the curve faithfully.
KAN: Converged to $\approx 0.00015$ MSE. The B-spline edge functions warped to match the symbolic curve shape directly, locking onto local features that the MLP missed entirely. Each of the 4 hidden-layer splines specialized to capture a different component of the target -- some tracing the $\sin(3x)$ backbone, others honing in on the damped high-frequency oscillation near the origin.

With 10x fewer parameters, the KAN achieved roughly 38x lower test loss. This is not a marginal improvement -- it is a qualitative difference between "approximately right" and "functionally exact."

4. Takeaways

For scientific computing -- PDEs, physical system modeling, symbolic tasks -- KANs offer a clear win in parameter efficiency and interpretability over MLPs. The interpretability angle is particularly valuable: because each edge function is a 1D spline, you can literally plot it and inspect what the network has learned. If the target contains a $\sin$ component, you will see a sinusoidal curve on the corresponding edge. Try extracting that kind of insight from a 32x32 weight matrix.

KANs also support grid extension -- after training on a coarse grid (say 10 knots), you can increase the grid resolution to 20 or 50 knots without retraining from scratch. The existing spline coefficients are interpolated onto the finer grid, providing a warm start for higher-fidelity refinement. This is analogous to adaptive mesh refinement in numerical PDE solvers, and has no natural counterpart in MLP training.

The main limitation is throughput. B-spline evaluation is slower on current hardware than dense matrix multiplies, so KANs are not yet competitive for billion-parameter language models. The core bottleneck is that spline evaluation involves per-edge lookups and recursion steps that do not map as cleanly to GPU warp-level parallelism as a single large GEMM call. That gap will narrow as CUDA kernels for spline operations improve, but for now KANs are best suited to small-to-medium scientific models where parameter efficiency and interpretability matter more than raw throughput.

Deconstructing Kolmogorov-Arnold Networks (KANs)

Part 3: Scaling, Benchmarking, and Parameter Efficiency

1. The Experiment: Symbolic Regression

The Target Function

2. Model Architecture & Parameters

3. Results

4. Takeaways