Deconstructing NeRF Kernel: Part 3 - 10.64 dB From One Trick

The Experiment

Two MLPs. Same architecture ($4$ layers, hidden dim $128$, ReLU + sigmoid). Same training data (a $64 \times 64$ image). Same training budget ($2{,}000$ iterations of AdamW at $\eta = 5 \times 10^{-4}$). Same initialisation seed. The only difference: one of them encodes its input coordinates through sin/cos at $6$ exponentially-spaced frequencies before the MLP sees them.

The PSNR difference is $10.64$ dB. The visual difference is qualitative: the unencoded MLP produces a smooth blob; the encoded MLP produces a recognisable image. This single design choice — change the input representation, not the architecture — is what made NeRF possible.

The Image We Fit

The target is a synthetic $64 \times 64$ RGB image designed to be hard for low-frequency-biased models. It contains four distinct features:

An $8 \times 8$ black-and-white checkerboard occupying the top half (high spatial frequency, regular structure).
A solid red circle in the bottom-left (smooth interior, sharp curved boundary).
A solid blue square in the bottom-right (smooth interior, sharp straight boundaries).
A 2-pixel-thick yellow diagonal line cutting across the bottom half (the highest-frequency feature in the image).

These four features span a spectrum of difficulty. Solid regions are easy for any MLP (low spatial frequency, smooth gradients). Sharp shape boundaries become representable with Fourier features (mid-to-high frequency). The diagonal yellow line tests the limit — it requires both high frequency and oriented structure, the kind of detail that pre-NeRF coordinate MLPs simply could not produce.

Results

Metric	No PE	With PE ($L = 6$)
Parameters	$33{,}795$	$36{,}611$
Final MSE	$0.0771$	$0.0067$
Final PSNR	$11.13$ dB	$21.77$ dB
PSNR improvement	—	$+10.64$ dB
MSE ratio	—	$11.6\times$ lower
Training time	$6.7$ s	$6.1$ s

The MLP weights ($33{,}795$) are the same in both. The encoded variant adds $2{,}816$ input weights (the first linear layer goes from $2 \to 128$ to $24 \to 128$), an $8\%$ parameter increase. The training time is essentially identical — the encoder is so cheap (no parameters, just a few elementwise operations) that the per-iteration cost is dominated by the MLP, which is unchanged.

A $10.64$ dB PSNR improvement corresponds to an $11.6\times$ reduction in MSE. This is not "marginal improvement" — it is the difference between an order-of-magnitude unrecognisable reconstruction and an order-of-magnitude recognisable one. The numerical metric is what we can report in a table; the visual difference is what makes the encoding indispensable in practice.

The Training Trajectories

The training curves tell a story the final numbers don't. The Fourier-encoded model reaches the unencoded model's final PSNR of $11.13$ dB after fewer than $100$ iterations — $20\times$ fewer training steps than the unencoded variant needs to reach the same point. The encoded model then continues to improve, plateauing around iteration $1400$ at $\sim 22$ dB.

The unencoded model is still slowly descending at iteration $2000$, but its trajectory suggests an asymptote in the low teens of dB — never enough to recover the sharp features of the image. Even with infinite training time, the unencoded MLP would not reach the encoded MLP's quality. The bottleneck is not optimization; it is representational capacity for high-frequency content.

This is what spectral bias means in practice. A ReLU MLP is not refusing to fit high frequencies — it is genuinely unable to, with realistic training budgets. The fitting time grows exponentially with frequency, so high-frequency components have effectively never converged within reasonable training budgets.

Visual Reconstruction

The qualitative reconstructions confirm the numerical result.

No positional encoding. The reconstruction is a smooth gradient of muted colors. The checkerboard region appears as a uniform gray patch — the alternating black and white squares blur into their pixel-wise average. The red circle is a fuzzy reddish blob with no defined edge. The blue square is a fuzzy bluish blob with no corners. The yellow diagonal line has disappeared entirely — its frequency is too high for the unencoded MLP to represent at all.

With Fourier features. The reconstruction is recognisable. The individual checkerboard cells are visible, though slightly soft at the edges. The circle has a crisp curved boundary; the square has corners. The yellow diagonal line is present as a continuous yellow streak. None of this is photographically perfect — the MLP is small and the encoding has limited frequency range — but the image is unambiguously the target image.

This is the qualitative difference that $10.64$ dB makes. Anyone looking at the two reconstructions side by side would describe the unencoded one as "broken" and the encoded one as "OK, with some softening at edges". The transition between those two qualitative regimes is what the entire neural-field literature is about.

Why Fourier Features Fix Spectral Bias

The ReLU MLP's spectral bias is not just an empirical observation — it is provable. Tancik et al. (2020) showed that the Neural Tangent Kernel (NTK) of a ReLU MLP decays at high frequencies. The NTK essentially describes how gradient descent moves the function during training; the decay means that high-frequency components of the target function converge slowly, while low-frequency components converge quickly. Same total training time, same final MSE on the low frequencies, dramatically different final MSE on the high frequencies.

Fourier feature encoding changes the input representation in a way that flattens the effective kernel across frequencies. After encoding, the input contains the high-frequency information explicitly. The MLP no longer has to manufacture high frequencies (which is what its spectral bias prevents); it has to linearly combine high-frequency features that are already present in the input.

Tancik et al.'s key result: the NTK of the encoded model is approximately stationary across frequencies, which is exactly the property gradient descent needs to converge uniformly. The empirical $10.64$ dB improvement is a direct consequence of this kernel-level change.

Why Exponential Frequencies Specifically

The choice of frequency spacing matters. Several alternatives one might consider:

Linear spacing. Use frequencies $1, 2, 3, \ldots, L$. This concentrates the bins around low frequencies — the highest frequency at $L = 6$ is $6\pi$, only $6\times$ the lowest. Misses high-frequency content entirely.

Single high frequency. Use just one large frequency, say $32\pi$. This captures high-frequency detail but misses everything else — the MLP would have to manufacture low frequencies from the high-frequency input, which is also a form of spectral bias in reverse.

Exponential spacing. Use frequencies $\pi, 2\pi, 4\pi, 8\pi, 16\pi, 32\pi$ — five decades of scale. Covers low, medium, and high frequencies in a single pass. This is what works empirically and what Tancik et al. recommend.

Natural signals (images, 3D scenes, audio) have approximately self-similar spectra at different scales. The exponential spacing matches this self-similarity, ensuring the model has access to information at the relevant scales of the data.

From 2D Image-Fitting to 3D NeRF

The 3D NeRF pipeline takes the same coordinate-MLP-plus-Fourier-features kernel and adds geometric machinery:

For each pixel of each training image, cast a viewing ray through the scene.
Sample 3D positions along the ray (typically 64 to 192 samples per ray).
For each sample, query the coordinate MLP to predict color and density.
Accumulate the per-sample predictions via the volume-rendering integral.
Compare the accumulated pixel color to the ground-truth pixel.
Backpropagate through the integral, the MLP, and the encoder.

The architecture and the encoding are unchanged from our 2D experiment. The geometry lives in the rendering pipeline, not in the model.

This split is conceptually important. Pre-NeRF "neural scene representations" tried to bake the geometry into the architecture (voxel-grid CNNs, scene-graph networks, etc.). NeRF kept the geometry simple (just ray casting) and put the representational power in the input encoding. The latter approach scales much better.

Generalisation Beyond NeRF

The Fourier-feature trick (and its descendants) appears across the neural-field literature:

SIREN (Sitzmann et al., 2020): sinusoidal activations everywhere, not just at the input. Achieves similar or better fidelity than Fourier features with end-to-end periodicity, but trickier to optimize.

Instant-NGP (Müller et al., 2022): replaces fixed Fourier features with a learnable multi-resolution hash encoding. Same core idea — replace raw coordinates with a richer representation — but with hash tables. Faster training, even higher fidelity.

Tri-Plane / Triplane encoding: represents a 3D scene as three 2D feature planes that the MLP samples from. Trades the simple Fourier-feature encoding for explicit spatial feature storage.

3D Gaussian Splatting (Kerbl et al., 2023): abandons the implicit neural field entirely. Represents the scene as a cloud of anisotropic 3D Gaussians, rendered by alpha-compositing. Trains in minutes, renders in milliseconds. But still descended from NeRF in spirit — the loss function is the same (rendered pixel vs ground truth).

The throughline: every successful follow-up to NeRF improved the input representation, not the MLP architecture. The MLP stayed a small fully-connected network. The encoding became fancier (hash tables, triplanes, explicit Gaussians). This is consistent with the broader thesis: architecture is rarely the bottleneck; input representation is.

What This Demonstrates

Fourier feature encoding lifts coordinate MLPs out of the low-frequency-only regime imposed by spectral bias.
On a $64 \times 64$ image-fit task, the encoding produces a $+10.64$ dB PSNR improvement (= $11.6\times$ lower MSE) with $8\%$ more parameters and the same training time.
This is the kernel of NeRF and every successor neural-field method. The architecture is incidental; the input representation is everything.
The same trick generalises across SIREN, Instant-NGP, Gaussian Splatting, and the broader neural-rendering literature — each method improves the input representation, never the MLP.

The Bigger Lesson

Architecture is not always the answer. The pivotal innovation behind every neural-field method since 2020 was an input-representation trick, not a new layer type. A $4$-layer MLP is enough capacity to represent a 3D scene; what was missing for years was the right way to feed coordinates into it. This pattern recurs across ML: positional encodings in Transformers, byte-pair encoding for LLMs, feature engineering in classical ML. Sometimes the hard problem is not "what to do with the inputs" but "what are the inputs".

Full code on GitHub: github.com/soveshmohapatra/NeRF-Kernel

Deconstructing the NeRF Kernel

Part 3: 10.64 dB PSNR From One Trick