Deconstructing Neural ODEs: Part 1 - The Math of Continuous Depth

Introduction

What if a neural network had no fixed depth? What if, instead of stacking a finite number of layers, we described the transformation of data as a continuous process governed by a differential equation?

This is the central idea of Neural Ordinary Differential Equations (Neural ODEs), introduced by Chen et al. (NeurIPS 2018). They replace discrete residual blocks with a continuous dynamical system:

\frac{d\mathbf{y}}{dt} = f(\mathbf{y}(t),\, t;\, \boldsymbol{\theta}), \qquad \mathbf{y}(0) = \mathbf{x},

where $f$ is a neural network parameterized by $\boldsymbol{\theta}$, and the "output" of the network is the state at $t=1$: $\hat{\mathbf{y}} = \mathbf{y}(1)$.

In this first part of a three-part series, we develop the mathematical foundations: initial value problems, numerical solvers, and the adjoint method for memory-efficient training.

ODEs and Initial Value Problems

An ordinary differential equation (ODE) describes how a quantity $\mathbf{y}(t) \in \mathbb{R}^d$ evolves over time:

\frac{d\mathbf{y}}{dt} = f(t,\, \mathbf{y}(t)).

Given an initial condition $\mathbf{y}(t_0) = \mathbf{y}_0$, this is called an initial value problem (IVP). Under mild regularity conditions on $f$ (Lipschitz continuity), the Picard-Lindelof theorem guarantees a unique solution exists.

In the neural network context:

$\mathbf{y}(t)$ is the hidden state at "time" (depth) $t$.
$f$ is the learned dynamics -- a neural network.
$\mathbf{y}_0 = \mathbf{x}$ is the input data.
$\mathbf{y}(T)$ is the output, with $T$ typically set to 1.

A critical property of ODE solutions is that trajectories cannot cross. In $\mathbb{R}^d$, two solution curves starting at different initial conditions will never intersect. This has deep implications for what Neural ODEs can and cannot represent -- and is the reason why augmenting the state dimension is essential for complex tasks.

Euler and Runge-Kutta Methods

Since we rarely have closed-form solutions, we solve the IVP numerically using fixed-step methods.

Forward Euler Method

The simplest approach: approximate the derivative with a forward difference.

\mathbf{y}_{n+1} = \mathbf{y}_n + \Delta t \cdot f(t_n,\, \mathbf{y}_n).

This is a first-order method: the local truncation error is $O(\Delta t^2)$ and the global error is $O(\Delta t)$.

Key observation: Compare this to a residual network block:

\mathbf{h}_{l+1} = \mathbf{h}_l + f_l(\mathbf{h}_l).

A ResNet is literally a forward Euler discretization of an ODE, with $\Delta t = 1$ and each block $f_l$ having its own parameters.

Classical Runge-Kutta (RK4)

A fourth-order method using four intermediate evaluations:

\begin{align} \mathbf{k}_1 &= f(t_n,\; \mathbf{y}_n), \\ \mathbf{k}_2 &= f\!\left(t_n + \tfrac{\Delta t}{2},\; \mathbf{y}_n + \tfrac{\Delta t}{2}\,\mathbf{k}_1\right), \\ \mathbf{k}_3 &= f\!\left(t_n + \tfrac{\Delta t}{2},\; \mathbf{y}_n + \tfrac{\Delta t}{2}\,\mathbf{k}_2\right), \\ \mathbf{k}_4 &= f(t_n + \Delta t,\; \mathbf{y}_n + \Delta t\,\mathbf{k}_3), \\[6pt] \mathbf{y}_{n+1} &= \mathbf{y}_n + \frac{\Delta t}{6}\left(\mathbf{k}_1 + 2\mathbf{k}_2 + 2\mathbf{k}_3 + \mathbf{k}_4\right). \end{align}

The global error is $O(\Delta t^4)$ -- dramatically better accuracy for the same step count, at the cost of four function evaluations per step instead of one.

Trade-off

For Neural ODEs, each "function evaluation" means a forward pass through the dynamics network $f$. So:

Euler: 1 NFE/step $\times$ 20 steps = 20 NFE per forward pass
RK4: 4 NFE/step $\times$ 20 steps = 80 NFE per forward pass

Whether the accuracy improvement justifies the $4\times$ compute cost depends on the problem.

From ResNets to Continuous Depth

The connection between residual networks and ODEs can be made precise.

A standard ResNet computes:

\mathbf{h}_{l+1} = \mathbf{h}_l + f_l(\mathbf{h}_l;\, \boldsymbol{\theta}_l), \qquad l = 0, 1, \ldots, L-1.

Each block $f_l$ has its own parameters $\boldsymbol{\theta}_l$. Now consider what happens as $L \to \infty$ and each block becomes infinitesimally small:

\mathbf{h}_{l+1} - \mathbf{h}_l = \frac{1}{L}\, f\!\left(\frac{l}{L},\, \mathbf{h}_l;\, \boldsymbol{\theta}\right) \quad \xrightarrow{L \to \infty} \quad \frac{d\mathbf{h}}{dt} = f(t,\, \mathbf{h};\, \boldsymbol{\theta}).

This is the Neural ODE. The discrete layer index $l$ becomes continuous time $t$. Separate per-layer parameters $\boldsymbol{\theta}_l$ collapse into a single shared $\boldsymbol{\theta}$.

Consequences:

Parameter efficiency: One network $f$ replaces $L$ separate blocks. In our experiments, this gives a $19.3\times$ parameter reduction.
Adaptive depth: We can use fewer or more solver steps at test time without retraining.
Invertibility: ODE flows are inherently invertible (just integrate backward in time), enabling normalizing flows.

The Adjoint Method

Training a Neural ODE requires computing gradients $\frac{\partial \mathcal{L}}{\partial \boldsymbol{\theta}}$. The naive approach (backpropagate through all solver steps) stores the entire forward trajectory, using $O(L)$ memory.

The adjoint method avoids this by solving a second ODE backward in time.

Define the adjoint state:

\mathbf{a}(t) = \frac{\partial \mathcal{L}}{\partial \mathbf{y}(t)}.

It satisfies the adjoint ODE:

\frac{d\mathbf{a}}{dt} = -\mathbf{a}(t)^\top \frac{\partial f}{\partial \mathbf{y}},

which is integrated backward from $t = T$ to $t = 0$, with terminal condition $\mathbf{a}(T) = \frac{\partial \mathcal{L}}{\partial \mathbf{y}(T)}$.

The parameter gradients accumulate as:

\frac{\partial \mathcal{L}}{\partial \boldsymbol{\theta}} = -\int_T^0 \mathbf{a}(t)^\top \frac{\partial f}{\partial \boldsymbol{\theta}} \, dt.

Memory cost: $O(1)$ in depth -- we only need the current state, not the full trajectory. This is the key advantage that makes Neural ODEs practical for very deep (many-step) integration.

Compute cost: We must re-solve the forward ODE (or store checkpoints) to evaluate $\frac{\partial f}{\partial \mathbf{y}}$ at each backward step. This trades memory for computation.

Summary

Neural ODEs replace discrete residual layers with a continuous ODE: $d\mathbf{y}/dt = f(t, \mathbf{y}; \boldsymbol{\theta})$.
Euler and RK4 solvers trade off accuracy vs. compute (1 vs. 4 function evaluations per step).
The ResNet-to-ODE connection is exact: a ResNet is a forward Euler discretization of an ODE.
The adjoint method enables $O(1)$-memory training by solving a backward ODE instead of backpropagating through all solver steps.

In Part 2, we implement all of this from scratch in pure PyTorch: ODE solvers, the dynamics network, the Neural ODE layer, and a classifier for spiral data.

Deconstructing Neural ODEs from Scratch

Part 1: The Math of Continuous Depth