Back to Neural ODEs Hub

Deconstructing Neural ODEs from Scratch

Part 1: The Math of Continuous Depth

Introduction

What if a neural network had no fixed depth? What if, instead of stacking a finite number of layers, we described the transformation of data as a continuous process governed by a differential equation?

This is the central idea of Neural Ordinary Differential Equations (Neural ODEs), introduced by Chen et al. (NeurIPS 2018). They replace discrete residual blocks with a continuous dynamical system:

$$ \frac{d\mathbf{y}}{dt} = f(\mathbf{y}(t),\, t;\, \boldsymbol{\theta}), \qquad \mathbf{y}(0) = \mathbf{x}, $$

where $f$ is a neural network parameterized by $\boldsymbol{\theta}$, and the "output" of the network is the state at $t=1$: $\hat{\mathbf{y}} = \mathbf{y}(1)$.

In this first part of a three-part series, we develop the mathematical foundations: initial value problems, numerical solvers, and the adjoint method for memory-efficient training.

ODEs and Initial Value Problems

An ordinary differential equation (ODE) describes how a quantity $\mathbf{y}(t) \in \mathbb{R}^d$ evolves over time:

$$ \frac{d\mathbf{y}}{dt} = f(t,\, \mathbf{y}(t)). $$

Given an initial condition $\mathbf{y}(t_0) = \mathbf{y}_0$, this is called an initial value problem (IVP). Under mild regularity conditions on $f$ (Lipschitz continuity), the Picard-Lindelof theorem guarantees a unique solution exists.

In the neural network context:

A critical property of ODE solutions is that trajectories cannot cross. In $\mathbb{R}^d$, two solution curves starting at different initial conditions will never intersect. This has deep implications for what Neural ODEs can and cannot represent -- and is the reason why augmenting the state dimension is essential for complex tasks.

Euler and Runge-Kutta Methods

Since we rarely have closed-form solutions, we solve the IVP numerically using fixed-step methods.

Forward Euler Method

The simplest approach: approximate the derivative with a forward difference.

$$ \mathbf{y}_{n+1} = \mathbf{y}_n + \Delta t \cdot f(t_n,\, \mathbf{y}_n). $$

This is a first-order method: the local truncation error is $O(\Delta t^2)$ and the global error is $O(\Delta t)$.

Key observation: Compare this to a residual network block:

$$ \mathbf{h}_{l+1} = \mathbf{h}_l + f_l(\mathbf{h}_l). $$

A ResNet is literally a forward Euler discretization of an ODE, with $\Delta t = 1$ and each block $f_l$ having its own parameters.

Classical Runge-Kutta (RK4)

A fourth-order method using four intermediate evaluations:

$$\begin{align} \mathbf{k}_1 &= f(t_n,\; \mathbf{y}_n), \\ \mathbf{k}_2 &= f\!\left(t_n + \tfrac{\Delta t}{2},\; \mathbf{y}_n + \tfrac{\Delta t}{2}\,\mathbf{k}_1\right), \\ \mathbf{k}_3 &= f\!\left(t_n + \tfrac{\Delta t}{2},\; \mathbf{y}_n + \tfrac{\Delta t}{2}\,\mathbf{k}_2\right), \\ \mathbf{k}_4 &= f(t_n + \Delta t,\; \mathbf{y}_n + \Delta t\,\mathbf{k}_3), \\[6pt] \mathbf{y}_{n+1} &= \mathbf{y}_n + \frac{\Delta t}{6}\left(\mathbf{k}_1 + 2\mathbf{k}_2 + 2\mathbf{k}_3 + \mathbf{k}_4\right). \end{align}$$

The global error is $O(\Delta t^4)$ -- dramatically better accuracy for the same step count, at the cost of four function evaluations per step instead of one.

Trade-off

For Neural ODEs, each "function evaluation" means a forward pass through the dynamics network $f$. So:

Whether the accuracy improvement justifies the $4\times$ compute cost depends on the problem.

From ResNets to Continuous Depth

The connection between residual networks and ODEs can be made precise.

A standard ResNet computes:

$$ \mathbf{h}_{l+1} = \mathbf{h}_l + f_l(\mathbf{h}_l;\, \boldsymbol{\theta}_l), \qquad l = 0, 1, \ldots, L-1. $$

Each block $f_l$ has its own parameters $\boldsymbol{\theta}_l$. Now consider what happens as $L \to \infty$ and each block becomes infinitesimally small:

$$ \mathbf{h}_{l+1} - \mathbf{h}_l = \frac{1}{L}\, f\!\left(\frac{l}{L},\, \mathbf{h}_l;\, \boldsymbol{\theta}\right) \quad \xrightarrow{L \to \infty} \quad \frac{d\mathbf{h}}{dt} = f(t,\, \mathbf{h};\, \boldsymbol{\theta}). $$

This is the Neural ODE. The discrete layer index $l$ becomes continuous time $t$. Separate per-layer parameters $\boldsymbol{\theta}_l$ collapse into a single shared $\boldsymbol{\theta}$.

Consequences:

  1. Parameter efficiency: One network $f$ replaces $L$ separate blocks. In our experiments, this gives a $19.3\times$ parameter reduction.
  2. Adaptive depth: We can use fewer or more solver steps at test time without retraining.
  3. Invertibility: ODE flows are inherently invertible (just integrate backward in time), enabling normalizing flows.

The Adjoint Method

Training a Neural ODE requires computing gradients $\frac{\partial \mathcal{L}}{\partial \boldsymbol{\theta}}$. The naive approach (backpropagate through all solver steps) stores the entire forward trajectory, using $O(L)$ memory.

The adjoint method avoids this by solving a second ODE backward in time.

Define the adjoint state:

$$ \mathbf{a}(t) = \frac{\partial \mathcal{L}}{\partial \mathbf{y}(t)}. $$

It satisfies the adjoint ODE:

$$ \frac{d\mathbf{a}}{dt} = -\mathbf{a}(t)^\top \frac{\partial f}{\partial \mathbf{y}}, $$

which is integrated backward from $t = T$ to $t = 0$, with terminal condition $\mathbf{a}(T) = \frac{\partial \mathcal{L}}{\partial \mathbf{y}(T)}$.

The parameter gradients accumulate as:

$$ \frac{\partial \mathcal{L}}{\partial \boldsymbol{\theta}} = -\int_T^0 \mathbf{a}(t)^\top \frac{\partial f}{\partial \boldsymbol{\theta}} \, dt. $$

Memory cost: $O(1)$ in depth -- we only need the current state, not the full trajectory. This is the key advantage that makes Neural ODEs practical for very deep (many-step) integration.

Compute cost: We must re-solve the forward ODE (or store checkpoints) to evaluate $\frac{\partial f}{\partial \mathbf{y}}$ at each backward step. This trades memory for computation.

Summary

In Part 2, we implement all of this from scratch in pure PyTorch: ODE solvers, the dynamics network, the Neural ODE layer, and a classifier for spiral data.