Deconstructing Neural ODEs: Part 2 - PyTorch Implementation

Introduction

In Part 1, we developed the mathematical foundations of Neural ODEs: initial value problems, Euler and RK4 solvers, the ResNet-to-ODE connection, and the adjoint method. Now we translate every equation into working PyTorch code.

Our implementation consists of three files: ode_solver.py (Euler and RK4 fixed-step solvers), neural_ode.py (ODEFunc, NeuralODELayer, ODEClassifier), and train.py (spiral dataset, training loop, visualization). No external ODE libraries. No torchdiffeq. Every operation is explicit.

ODE Solvers

Euler Step

The forward Euler update is a single line:

def euler_step(f, t, y, dt):
    return y + dt * f(t, y)

Because y has shape (batch, dim) and f returns the same shape, this naturally handles batched inputs. The scalar dt broadcasts across all dimensions.

RK4 Step

def rk4_step(f, t, y, dt):
    k1 = f(t, y)
    k2 = f(t + dt/2, y + dt/2 * k1)
    k3 = f(t + dt/2, y + dt/2 * k2)
    k4 = f(t + dt, y + dt * k3)
    return y + (dt/6) * (k1 + 2*k2 + 2*k3 + k4)

Four evaluations of f, combined in a weighted average. The intermediate stages $\mathbf{k}_2$ and $\mathbf{k}_3$ evaluate at the midpoint $t + \Delta t/2$, which is where the fourth-order accuracy comes from.

Integration Loop

def ode_solve(f, y0, t_span, n_steps, method="euler"):
    step_fn = {"euler": euler_step, "rk4": rk4_step}[method]
    t0, t1 = t_span
    dt = (t1 - t0) / n_steps
    trajectory = [y0]
    y, t = y0, t0
    for _ in range(n_steps):
        y = step_fn(f, t, y, dt)
        t = t + dt
        trajectory.append(y)
    return torch.stack(trajectory, dim=0), ...

The function returns the full trajectory tensor of shape (n_steps+1, batch, dim) -- useful for visualization -- plus the time grid.

The Dynamics Network: ODEFunc

The heart of a Neural ODE is the function $f(t, \mathbf{y}; \boldsymbol{\theta})$ that defines the dynamics.

Architecture

class ODEFunc(nn.Module):
    def __init__(self, dim, hidden=128):
        self.net = nn.Sequential(
            nn.Linear(dim + 1, hidden),
            nn.Tanh(),
            nn.Linear(hidden, hidden),
            nn.Tanh(),
            nn.Linear(hidden, dim),
        )

Three key design decisions:

Time concatenation. The input is $[\,t\;;\;\mathbf{y}\,] \in \mathbb{R}^{d+1}$, not just $\mathbf{y}$. This makes the dynamics non-autonomous: the vector field can change as we integrate. Without time, the flow is constrained to follow the same vector field everywhere, limiting expressivity.

Tanh activation. We use $\tanh$ rather than ReLU for a specific reason: the dynamics $f$ should be Lipschitz continuous to guarantee well-posed ODEs. $\tanh$ is globally Lipschitz (bounded derivatives), while ReLU has discontinuous gradients at zero. This matters for solver stability.

Zero initialization of the last layer. We initialize the final linear layer to zero weights and biases:

nn.init.zeros_(self.net[-1].weight)
nn.init.zeros_(self.net[-1].bias)

This means the initial dynamics are $d\mathbf{y}/dt \approx 0$, so $\mathbf{y}(1) \approx \mathbf{y}(0)$. The Neural ODE starts as an approximate identity map and gradually learns to deform the space. This dramatically improves optimization stability.

The Neural ODE Layer

class NeuralODELayer(nn.Module):
    def forward(self, y0):
        trajectory, _ = ode_solve(
            self.func, y0, (0.0, 1.0), self.n_steps, self.method
        )
        return trajectory[-1]  # y(1)

The layer takes in $\mathbf{y}_0$ and returns $\mathbf{y}(1)$ by calling the ODE solver. Standard PyTorch autograd handles the backward pass by backpropagating through all solver steps.

For the experimental adjoint path, we implement a custom torch.autograd.Function that only stores the final state and passes gradients through without storing the full trajectory.

The Classifier: Dimensional Lifting

Why Lifting is Necessary

A fundamental limitation of ODEs in $\mathbb{R}^d$: by the uniqueness theorem, solution trajectories cannot cross. In 2D, this means a continuous flow cannot separate two interleaving spirals -- the trajectories would have to pass through each other.

The solution: lift the data to a higher-dimensional space before the ODE.

class ODEClassifier(nn.Module):
    def __init__(self, input_dim, n_classes, ode_dim=6, ...):
        self.lift = nn.Linear(input_dim, ode_dim)  # 2D -> 6D
        self.odefunc = ODEFunc(ode_dim, hidden=128)
        self.ode_layer = NeuralODELayer(...)
        self.classifier = nn.Sequential(
            nn.Linear(ode_dim, 64),
            nn.ReLU(),
            nn.Linear(64, n_classes),
        )

The architecture is:

\mathbf{x} \in \mathbb{R}^2 \;\xrightarrow{\text{lift}}\; \mathbf{y}_0 \in \mathbb{R}^6 \;\xrightarrow{\text{ODE}}\; \mathbf{y}(1) \in \mathbb{R}^6 \;\xrightarrow{\text{classify}}\; \hat{\mathbf{c}} \in \mathbb{R}^2

The Augmented Neural ODE Perspective

This lifting strategy is equivalent to the Augmented Neural ODE (Dupont et al., 2019). Instead of running the ODE in the original input space, we augment the state with extra dimensions initialized via a learned projection. The ODE operates in this higher-dimensional space where the uniqueness constraint no longer prevents separation.

In our case, lifting from 2D to 6D provides four extra dimensions for the flow to use in untangling the spirals.

The Baseline: ResNet MLP

For comparison, we implement a discrete residual MLP that mirrors the Neural ODE:

class ResNetMLP(nn.Module):
    def forward(self, x):
        h = self.lift(x)
        dt = 1.0 / self.n_blocks
        for block in self.blocks:
            h = h + dt * block(h)  # Euler-like residual
        return self.classifier(h)

Each of the 20 blocks has its own set of weights (independent nn.Linear layers), while the Neural ODE reuses the same ODEFunc weights at every step. The dt = 1/n_blocks scaling mirrors the Euler method step size.

Summary

ODE solvers are simple loops: Euler is one line, RK4 is six lines.
The dynamics network concatenates time to the state and uses Tanh activations for Lipschitz regularity.
Zero-initializing the output layer starts the ODE as an identity, aiding optimization.
Lifting to higher dimensions is essential -- 2D flows cannot separate spirals.
The ResNet baseline uses 20 independent weight sets; the Neural ODE shares one.

In Part 3, we train everything on the spiral dataset and analyze the results: accuracy, speed, trajectories, and memory.

Deconstructing Neural ODEs from Scratch

Part 2: PyTorch Implementation