Deconstructing the NeRF Kernel: Part 1 — Why MLPs Need Fourier Features to See

Overview

NeRF is a SIREN that learned to render. The geometry is not in the architecture — it is in the loss. The pivotal innovation behind every neural-field method since NeRF is a single trick: Fourier feature encoding of the input coordinates.

The Spectral Bias of ReLU MLPs

Tancik et al. (2020) showed that the Neural Tangent Kernel of a ReLU MLP decays with frequency. High-frequency components of the target function converge much more slowly than low-frequency ones — a ReLU MLP cannot fit sharp edges, fine textures, or periodic structure from coordinate inputs at realistic budgets. This is what doomed coordinate-based scene representations before NeRF.

Fourier Feature Encoding

The fix: lift the coordinate $\mathbf{p}$ into a higher-dimensional Fourier space:

\gamma(\mathbf{p}) = \big[ \sin(2^0 \pi \mathbf{p}),\, \cos(2^0 \pi \mathbf{p}),\, \sin(2^1 \pi \mathbf{p}),\, \cos(2^1 \pi \mathbf{p}),\, \ldots,\, \sin(2^{L-1} \pi \mathbf{p}),\, \cos(2^{L-1} \pi \mathbf{p}) \big].

The MLP now operates on $\gamma(\mathbf{p}) \in \mathbb{R}^{2 L \, \dim(\mathbf{p})}$ instead of $\mathbf{p}$. The MLP no longer manufactures high frequencies — they are already in the input.

Why Exponentially-Spaced Frequencies?

Natural signals have approximately self-similar spectra at different scales. Linear-spaced frequencies waste capacity on one scale. Exponential spacing covers many decades with few bins. $L = 6$ covers $1, 2, 4, 8, 16, 32$ cycles per unit input.

Why Both Sin AND Cos?

$\sin(\omega p)$ and $\cos(\omega p)$ together span every phase at frequency $\omega$. With only $\sin$ the MLP would be confined to a fixed phase the encoder cannot adjust (the encoder has no trainable parameters).

NeRF in Three Equations

Let $\mathbf{r}(t) = \mathbf{o} + t \mathbf{d}$ be a ray. Sample $N$ points and predict color + density:

(\mathbf{c}_i, \sigma_i) = f_\theta\!\big( \gamma_x(\mathbf{r}(t_i)),\; \gamma_d(\mathbf{d}) \big).

Volume render via the alpha-compositing integral:

C(\mathbf{r}) = \sum_{i=1}^{N} T_i \, (1 - \exp(-\sigma_i \, \delta_i)) \, \mathbf{c}_i, \quad T_i = \exp\!\left( -\sum_{j < i} \sigma_j \, \delta_j \right).

Per-pixel MSE loss between $C(\mathbf{r})$ and ground truth. Train end-to-end.

What This Series Demonstrates

Implementing full 3D NeRF (multi-view poses, ray sampling, volume rendering) is roughly 500-800 lines of code with hours of GPU training. What we demonstrate is the kernel: a coordinate MLP plus Fourier features, fitting a high-frequency 2D image. This captures the same numerical phenomenon (spectral bias defeated by Fourier features) at a scale that trains in 6 seconds.