Deconstructing Neural ODEs: Part 1 - The Math of Continuous Depth

Introduction

Standard neural networks have a fixed depth: you stack $L$ layers and that's it. Neural Ordinary Differential Equations (Chen et al., NeurIPS 2018) replace that discrete stack with a continuous dynamical system:

\frac{d\mathbf{y}}{dt} = f(\mathbf{y}(t),\, t;\, \boldsymbol{\theta}), \qquad \mathbf{y}(0) = \mathbf{x},

where $f$ is a neural network parameterized by $\boldsymbol{\theta}$, and the "output" is the state at $t=1$: $\hat{\mathbf{y}} = \mathbf{y}(1)$.

This post covers the mathematical foundations: initial value problems, Euler and RK4 solvers, and the adjoint method for memory-efficient training.

ODEs and Initial Value Problems

An ordinary differential equation (ODE) describes how a quantity $\mathbf{y}(t) \in \mathbb{R}^d$ evolves over time:

\frac{d\mathbf{y}}{dt} = f(t,\, \mathbf{y}(t)).

Given an initial condition $\mathbf{y}(t_0) = \mathbf{y}_0$, this is called an initial value problem (IVP). Under mild regularity conditions on $f$ (Lipschitz continuity), the Picard-Lindelof theorem guarantees a unique solution exists.

In the neural network context:

$\mathbf{y}(t)$ is the hidden state at "time" (depth) $t$.
$f$ is the learned dynamics -- a neural network.
$\mathbf{y}_0 = \mathbf{x}$ is the input data.
$\mathbf{y}(T)$ is the output, with $T$ typically set to 1.

One property worth flagging now: trajectories cannot cross. In $\mathbb{R}^d$, two solution curves starting at different initial conditions will never intersect. This constrains what Neural ODEs can represent and is why augmenting the state dimension matters for harder tasks.

Euler and Runge-Kutta Methods

Since we rarely have closed-form solutions, we solve the IVP numerically using fixed-step methods.

Forward Euler Method

The simplest approach: approximate the derivative with a forward difference.

\mathbf{y}_{n+1} = \mathbf{y}_n + \Delta t \cdot f(t_n,\, \mathbf{y}_n).

This is a first-order method: the local truncation error is $O(\Delta t^2)$ and the global error is $O(\Delta t)$.

Key observation: Compare this to a residual network block:

\mathbf{h}_{l+1} = \mathbf{h}_l + f_l(\mathbf{h}_l).

A ResNet is literally a forward Euler discretization of an ODE, with $\Delta t = 1$ and each block $f_l$ having its own parameters.

Classical Runge-Kutta (RK4)

A fourth-order method using four intermediate evaluations:

\begin{align} \mathbf{k}_1 &= f(t_n,\; \mathbf{y}_n), \\ \mathbf{k}_2 &= f\!\left(t_n + \tfrac{\Delta t}{2},\; \mathbf{y}_n + \tfrac{\Delta t}{2}\,\mathbf{k}_1\right), \\ \mathbf{k}_3 &= f\!\left(t_n + \tfrac{\Delta t}{2},\; \mathbf{y}_n + \tfrac{\Delta t}{2}\,\mathbf{k}_2\right), \\ \mathbf{k}_4 &= f(t_n + \Delta t,\; \mathbf{y}_n + \Delta t\,\mathbf{k}_3), \\[6pt] \mathbf{y}_{n+1} &= \mathbf{y}_n + \frac{\Delta t}{6}\left(\mathbf{k}_1 + 2\mathbf{k}_2 + 2\mathbf{k}_3 + \mathbf{k}_4\right). \end{align}

The global error is $O(\Delta t^4)$ -- much better accuracy for the same step count, at the cost of four function evaluations per step instead of one.

Trade-off

For Neural ODEs, each "function evaluation" means a forward pass through the dynamics network $f$. So:

Euler: 1 NFE/step $\times$ 20 steps = 20 NFE per forward pass
RK4: 4 NFE/step $\times$ 20 steps = 80 NFE per forward pass

Whether the accuracy improvement justifies the $4\times$ compute cost depends on the problem.

From ResNets to Continuous Depth

The connection between residual networks and ODEs is exact, not just an analogy.

A standard ResNet computes:

\mathbf{h}_{l+1} = \mathbf{h}_l + f_l(\mathbf{h}_l;\, \boldsymbol{\theta}_l), \qquad l = 0, 1, \ldots, L-1.

Each block $f_l$ has its own parameters $\boldsymbol{\theta}_l$. Now consider what happens as $L \to \infty$ and each block becomes infinitesimally small:

\mathbf{h}_{l+1} - \mathbf{h}_l = \frac{1}{L}\, f\!\left(\frac{l}{L},\, \mathbf{h}_l;\, \boldsymbol{\theta}\right) \quad \xrightarrow{L \to \infty} \quad \frac{d\mathbf{h}}{dt} = f(t,\, \mathbf{h};\, \boldsymbol{\theta}).

That's the Neural ODE. The discrete layer index $l$ becomes continuous time $t$, and separate per-layer parameters $\boldsymbol{\theta}_l$ collapse into a single shared $\boldsymbol{\theta}$.

Consequences:

Parameter efficiency: One network $f$ replaces $L$ separate blocks. In our experiments (Part 3), this gives a $19.3\times$ parameter reduction.
Adaptive depth: You can use fewer or more solver steps at test time without retraining.
Invertibility: ODE flows are invertible (integrate backward in time), which enables normalizing flows.

The Adjoint Method

Training a Neural ODE requires computing gradients $\frac{\partial \mathcal{L}}{\partial \boldsymbol{\theta}}$. The naive approach (backpropagate through all solver steps) stores the entire forward trajectory, using $O(L)$ memory.

The adjoint method avoids this by solving a second ODE backward in time.

Define the adjoint state:

\mathbf{a}(t) = \frac{\partial \mathcal{L}}{\partial \mathbf{y}(t)}.

It satisfies the adjoint ODE:

\frac{d\mathbf{a}}{dt} = -\mathbf{a}(t)^\top \frac{\partial f}{\partial \mathbf{y}},

which is integrated backward from $t = T$ to $t = 0$, with terminal condition $\mathbf{a}(T) = \frac{\partial \mathcal{L}}{\partial \mathbf{y}(T)}$.

The parameter gradients accumulate as:

\frac{\partial \mathcal{L}}{\partial \boldsymbol{\theta}} = -\int_T^0 \mathbf{a}(t)^\top \frac{\partial f}{\partial \boldsymbol{\theta}} \, dt.

Memory cost: $O(1)$ in depth -- we only need the current state, not the full trajectory. This is what makes Neural ODEs practical when the solver takes many steps.

Compute cost: We must re-solve the forward ODE (or store checkpoints) to evaluate $\frac{\partial f}{\partial \mathbf{y}}$ at each backward step. This trades memory for computation.

Summary

Neural ODEs replace discrete residual layers with a continuous ODE: $d\mathbf{y}/dt = f(t, \mathbf{y}; \boldsymbol{\theta})$.
Euler and RK4 solvers trade off accuracy vs. compute (1 vs. 4 function evaluations per step).
The ResNet-to-ODE connection is exact: a ResNet is a forward Euler discretization of an ODE.
The adjoint method enables $O(1)$-memory training by solving a backward ODE instead of backpropagating through all solver steps.

In Part 2, we implement all of this from scratch in pure PyTorch: ODE solvers, the dynamics network, the Neural ODE layer, and a classifier for spiral data.

Deconstructing Neural ODEs from Scratch

Part 1: The Math of Continuous Depth