Deconstructing NeRF Kernel: Part 2 - Forty Lines

Overview

Part 1 argued that the pivotal innovation in NeRF — and in every neural-field method since — was not the architecture but the input representation. Replacing raw coordinates with Fourier features turns a coordinate MLP from a low-frequency-only function into a representation rich enough to capture high-frequency visual detail.

This part implements the kernel: a FourierFeatures encoder with no trainable parameters and a CoordMLP that consumes the encoded coordinates and produces RGB. Total code: $40$ lines for the kernel, plus a training script that fits a $64 \times 64$ image. The benchmarks in Part 3 will show what those $40$ lines buy you.

Why This Particular Implementation

The full NeRF pipeline has three components: a coordinate MLP with Fourier features, a viewing-ray sampling procedure, and a volume-rendering integral that turns per-point colour and density into a final pixel. The first component is the part that made NeRF possible. The other two are essentially geometry — the ray sampling is camera mathematics, the volume rendering is alpha compositing. Both predate NeRF.

For this series we focus on the first component, which is also the part that transfers to other neural-field methods (SIREN, Instant-NGP, etc.) that do not use ray-based rendering. The benchmark in Part 3 is 2D image regression rather than 3D scene reconstruction, but the underlying numerical phenomenon — spectral bias defeated by Fourier features — is identical.

If you want full 3D NeRF, you add about $500$ lines of ray-casting and volume-rendering code on top of what's described here. The hard part of NeRF is not the rendering; it is the kernel.

The Fourier Features Class

class FourierFeatures(nn.Module):
    def __init__(self, input_dim, num_freqs):
        super().__init__()
        self.input_dim = input_dim
        self.num_freqs = num_freqs
        freqs = (2.0 ** torch.arange(num_freqs)) * torch.pi
        self.register_buffer("freqs", freqs)

    @property
    def out_dim(self):
        return 2 * self.num_freqs * self.input_dim

    def forward(self, x):
        x_scaled = x.unsqueeze(-1) * self.freqs       # (..., D, L)
        sin = torch.sin(x_scaled)
        cos = torch.cos(x_scaled)
        feats = torch.stack([sin, cos], dim=-1)       # (..., D, L, 2)
        return feats.flatten(start_dim=-3)            # (..., 2 L D)

Six lines of forward. No trainable parameters anywhere. Several design choices encoded in this short snippet.

Frequencies are precomputed and stored as a buffer. The line freqs = (2.0 ** torch.arange(num_freqs)) * torch.pi computes the $L$ frequencies $\pi, 2\pi, 4\pi, 8\pi, \ldots, 2^{L-1} \pi$ once at construction time. register_buffer makes them part of the module state — they move with .to(device) and appear in checkpoints — but PyTorch knows they are not trainable, so they don't appear in parameters() or receive gradients.

Why exponentially-spaced frequencies? Natural signals (images, 3D scenes) have approximately self-similar spectra at different scales — a face has features at scale 10cm (nose, eyes), 1cm (eyebrows, lips), 1mm (pores, hairs), all simultaneously. Linear-spaced frequencies would cluster around one scale and miss the others. Exponential spacing covers many orders of magnitude with a small number of bins. $L = 6$ frequencies cover the range $1$ to $32$ cycles per unit input — five decades of scale, which is enough for most natural-image content.

Why both sin and cos at each frequency? Together, $\sin(\omega p)$ and $\cos(\omega p)$ form a basis for any phase at frequency $\omega$. Any waveform $A \sin(\omega p + \phi)$ can be written as a linear combination of $\sin(\omega p)$ and $\cos(\omega p)$, so the MLP can recover any phase by linearly combining the two features.

If we only included $\sin$, the MLP would be locked to a fixed phase. The encoder has no learnable parameters, so it cannot adjust the phase based on the data. Including $\cos$ as well gives the MLP the flexibility to represent any phase.

The Tensor Reshape Magic

Three lines of the forward pass deserve attention: how the encoder broadcasts an input of shape $(B, D)$ into an output of shape $(B, 2LD)$.

x_scaled = x.unsqueeze(-1) * self.freqs       # (B, D) * (L,) → (B, D, L)
sin = torch.sin(x_scaled)
cos = torch.cos(x_scaled)
feats = torch.stack([sin, cos], dim=-1)       # (B, D, L, 2)
return feats.flatten(start_dim=-3)            # (B, 2*L*D)

Step 1. x.unsqueeze(-1) turns shape $(B, D)$ into $(B, D, 1)$. Multiplying by self.freqs of shape $(L,)$ broadcasts to $(B, D, L)$: each coordinate scaled by each frequency.

Step 2. Apply $\sin$ and $\cos$ element-wise — both produce $(B, D, L)$ tensors.

Step 3. torch.stack([sin, cos], dim=-1) stacks them along a new last axis, producing $(B, D, L, 2)$.

Step 4. feats.flatten(start_dim=-3) collapses the last three axes $(D, L, 2)$ into one of size $2LD$. The MLP sees a flat vector of $2LD$ features per input.

This is one of those PyTorch idioms that look impenetrable until you trace the shapes. Once you do, it's just clean broadcasting. No reshape gymnastics, no transpose tricks — every step is shape-preserving in a natural way.

The CoordMLP

class CoordMLP(nn.Module):
    def __init__(self, input_dim=2, output_dim=3, hidden=128, n_layers=4,
                 use_pe=True, num_freqs=6):
        super().__init__()
        if use_pe:
            self.encoder = FourierFeatures(input_dim, num_freqs)
            in_features = self.encoder.out_dim
        else:
            self.encoder = nn.Identity()
            in_features = input_dim
        layers = [nn.Linear(in_features, hidden), nn.ReLU()]
        for _ in range(n_layers - 2):
            layers += [nn.Linear(hidden, hidden), nn.ReLU()]
        layers.append(nn.Linear(hidden, output_dim))
        self.mlp = nn.Sequential(*layers)

    def forward(self, x):
        return torch.sigmoid(self.mlp(self.encoder(x)))

The use_pe flag toggles the encoder. With it on, the first nn.Linear receives $2 L D$ inputs ($24$ for our $2$D coordinates with $L = 6$). With it off, it receives the raw $D$ inputs ($2$ coordinates).

Everything else — the layer count, hidden dimension, activation function, output projection — is identical between the two variants. This is essential for the fair comparison in Part 3: we want to isolate the effect of the encoder, not confound it with architecture changes.

The sigmoid output. RGB values lie in $[0, 1]$. The sigmoid output activation bounds the model's output to this range, which is both correct (no impossible pixel values) and helpful for optimization (the sigmoid's saturating gradients near $0$ and $1$ are gentler than a clamp's discontinuity at the boundary).

ReLU activations. Standard for MLPs. Note that the spectral-bias problem we are solving with Fourier features is specific to ReLU MLPs — sinusoidal-activation MLPs (SIRENs) solve the same problem differently, by making the entire network operate in a more periodic regime. Tanh MLPs would also have spectral bias, just with different rates. The Fourier-feature solution is independent of the activation function; it operates at the input layer.

Why the Encoder Has Zero Trainable Parameters

This is worth emphasising. Many "feature engineering" steps in classical ML are also trainable: feature scaling parameters, learned tokenization, learned embedding tables. The Fourier features encoder has none of this.

The frequencies are fixed at construction time. The sin/cos functions are deterministic. The flatten is a reshape, not an operation with parameters. The only thing the model learns is the MLP that consumes the features.

This is a deliberate design choice. The job of the encoder is to provide a fixed, well-conditioned basis for the MLP. Adding trainable parameters to the encoder would let the model adjust the basis during training, which sounds like it should help — but in practice it makes the optimization landscape much harder because the basis and the MLP are simultaneously moving targets.

Some later methods (Instant-NGP) do add learnable encoding parameters — but they do so very carefully, with multi-resolution hash grids that decouple different scales. The simple "learnable Fourier features" naive approach doesn't work as well as fixed sin/cos.

Training Loop

model = CoordMLP(use_pe=True, num_freqs=6).to(device)
opt = torch.optim.AdamW(model.parameters(), lr=5e-4)
for _ in range(2000):
    pred = model(coords)               # coords: (H*W, 2)
    loss = F.mse_loss(pred, target)    # target: (H*W, 3)
    opt.zero_grad()
    loss.backward()
    opt.step()

Image regression. Build a $(H \cdot W, 2)$ grid of normalised pixel coordinates (range $[-1, 1]$), the corresponding $(H \cdot W, 3)$ RGB target. Minimize MSE for $2000$ iterations of full-batch AdamW at $\eta = 5 \times 10^{-4}$.

"Full batch" is unusual for ML training but appropriate here. The dataset is tiny ($64 \times 64 = 4096$ pixels), so the whole thing fits comfortably in memory. With no mini-batching, every iteration sees the entire image, and the loss landscape is fully deterministic — no stochastic gradient noise.

The lack of validation set. For coordinate-based image fitting, there is no train/test split in the usual sense. We are fitting one image, not learning a function that generalises across images. The "test" is whether the fitted model can reproduce the target image, which is essentially the training MSE. Both variants (PE and no-PE) are evaluated on the same fitting task.

Computing PSNR

Mean squared error is a useful loss but a hard-to-interpret quality metric. Image-quality literature reports peak signal-to-noise ratio (PSNR):

\text{PSNR} = 10 \log_{10}(\text{MAX}^2 / \text{MSE})

For images normalised to $[0, 1]$, $\text{MAX} = 1$, so PSNR $= -10 \log_{10}(\text{MSE})$. A $10$ dB increase in PSNR corresponds to a $10\times$ reduction in MSE. A few reference points: $20$ dB is "noticeable artifacts", $30$ dB is "good quality JPEG", $40$ dB is "visually indistinguishable from the original".

The point of converting to PSNR is interpretability. The Part 3 result will be that Fourier features add $+10.64$ dB PSNR over the no-encoding baseline. That tells you immediately: an order-of-magnitude improvement, a real qualitative change, not just a marginal numeric improvement.

What Production NeRF Adds

To go from our 2D-image-fitting demo to a working 3D NeRF, you would add:

Ray sampling. Given camera parameters (origin, viewing direction, focal length), generate rays for each pixel. Sample 3D positions along each ray (stratified sampling or hierarchical sampling for efficiency).

Volume rendering. For each ray's sampled positions, query the coordinate MLP to predict $(c_i, \sigma_i)$ — colour and density. Composite them via $C(\mathbf{r}) = \sum_i T_i (1 - \exp(-\sigma_i \delta_i)) c_i$ where $T_i = \exp(-\sum_{j

View-direction encoding. NeRF encodes both position $\mathbf{x}$ and view direction $\mathbf{d}$ with separate Fourier features. The view direction enters late in the MLP (after the density is computed) so that density depends only on position but colour can vary with viewing angle.

Two-stage sampling. A coarse network samples uniformly along the ray; its density predictions are used to importance-sample for a fine network. This is the "hierarchical sampling" trick that gives NeRF most of its quality.

None of this changes the kernel. The MLP is still CoordMLP. The encoder is still FourierFeatures. The difference is in what gets queried and how outputs are aggregated.

Generalisation: Beyond NeRF

The Fourier feature trick appears across the entire neural-field literature:

SIREN (Sitzmann et al., 2020) uses sinusoidal activations everywhere in the network, not just at the input. The whole MLP operates in a periodic regime, which gives even higher fidelity than fixed Fourier features at the cost of trickier optimization.

Instant-NGP (Müller et al., 2022) replaces fixed sin/cos with a learnable multi-resolution hash encoding. Same idea — replace raw coordinates with a richer representation — implemented with hash tables for efficiency. Achieves higher fidelity than fixed Fourier features and trains in seconds rather than hours.

Gaussian Splatting (Kerbl et al., 2023) abandons the implicit neural field entirely. Represents the scene as a cloud of anisotropic 3D Gaussians. Renders by alpha-compositing them directly. No coordinate MLP at all. Trains in minutes, renders in milliseconds. But still descended from NeRF — the loss function is the same.

Each successor improved on the input representation, not the MLP architecture. The pattern: architecture is not the bottleneck; input representation is. A 4-layer MLP is enough capacity to represent a 3D scene; what was missing for years was the right way to feed coordinates into it.

What Part 3 Tests

Part 3 runs the head-to-head: same MLP, same data, same training budget — with and without Fourier features. The PSNR difference is $10.64$ dB. The visual difference between the two reconstructions is the difference between "blurry blob" and "recognisable image". This single design choice — encode coordinates through sin/cos at exponentially-spaced frequencies — is what makes coordinate-based neural fields a viable representation for visual content at all.

Full code on GitHub: github.com/soveshmohapatra/NeRF-Kernel

Deconstructing the NeRF Kernel

Part 2: Forty Lines of PyTorch