What if a neural network had no fixed depth? What if, instead of stacking a finite number of layers, we described the transformation of data as a continuous process governed by a differential equation?
This is the central idea of Neural Ordinary Differential Equations (Neural ODEs), introduced by Chen et al. (NeurIPS 2018). They replace discrete residual blocks with a continuous dynamical system:
where $f$ is a neural network parameterized by $\boldsymbol{\theta}$, and the "output" of the network is the state at $t=1$: $\hat{\mathbf{y}} = \mathbf{y}(1)$.
In this first part of a three-part series, we develop the mathematical foundations: initial value problems, numerical solvers, and the adjoint method for memory-efficient training.
ODEs and Initial Value Problems
An ordinary differential equation (ODE) describes how a quantity $\mathbf{y}(t) \in \mathbb{R}^d$ evolves over time:
Given an initial condition $\mathbf{y}(t_0) = \mathbf{y}_0$, this is called an initial value problem (IVP). Under mild regularity conditions on $f$ (Lipschitz continuity), the Picard-Lindelof theorem guarantees a unique solution exists.
In the neural network context:
- $\mathbf{y}(t)$ is the hidden state at "time" (depth) $t$.
- $f$ is the learned dynamics -- a neural network.
- $\mathbf{y}_0 = \mathbf{x}$ is the input data.
- $\mathbf{y}(T)$ is the output, with $T$ typically set to 1.
A critical property of ODE solutions is that trajectories cannot cross. In $\mathbb{R}^d$, two solution curves starting at different initial conditions will never intersect. This has deep implications for what Neural ODEs can and cannot represent -- and is the reason why augmenting the state dimension is essential for complex tasks.
Euler and Runge-Kutta Methods
Since we rarely have closed-form solutions, we solve the IVP numerically using fixed-step methods.
Forward Euler Method
The simplest approach: approximate the derivative with a forward difference.
This is a first-order method: the local truncation error is $O(\Delta t^2)$ and the global error is $O(\Delta t)$.
Key observation: Compare this to a residual network block:
A ResNet is literally a forward Euler discretization of an ODE, with $\Delta t = 1$ and each block $f_l$ having its own parameters.
Classical Runge-Kutta (RK4)
A fourth-order method using four intermediate evaluations:
The global error is $O(\Delta t^4)$ -- dramatically better accuracy for the same step count, at the cost of four function evaluations per step instead of one.
Trade-off
For Neural ODEs, each "function evaluation" means a forward pass through the dynamics network $f$. So:
- Euler: 1 NFE/step $\times$ 20 steps = 20 NFE per forward pass
- RK4: 4 NFE/step $\times$ 20 steps = 80 NFE per forward pass
Whether the accuracy improvement justifies the $4\times$ compute cost depends on the problem.
From ResNets to Continuous Depth
The connection between residual networks and ODEs can be made precise.
A standard ResNet computes:
Each block $f_l$ has its own parameters $\boldsymbol{\theta}_l$. Now consider what happens as $L \to \infty$ and each block becomes infinitesimally small:
This is the Neural ODE. The discrete layer index $l$ becomes continuous time $t$. Separate per-layer parameters $\boldsymbol{\theta}_l$ collapse into a single shared $\boldsymbol{\theta}$.
Consequences:
- Parameter efficiency: One network $f$ replaces $L$ separate blocks. In our experiments, this gives a $19.3\times$ parameter reduction.
- Adaptive depth: We can use fewer or more solver steps at test time without retraining.
- Invertibility: ODE flows are inherently invertible (just integrate backward in time), enabling normalizing flows.
The Adjoint Method
Training a Neural ODE requires computing gradients $\frac{\partial \mathcal{L}}{\partial \boldsymbol{\theta}}$. The naive approach (backpropagate through all solver steps) stores the entire forward trajectory, using $O(L)$ memory.
The adjoint method avoids this by solving a second ODE backward in time.
Define the adjoint state:
It satisfies the adjoint ODE:
which is integrated backward from $t = T$ to $t = 0$, with terminal condition $\mathbf{a}(T) = \frac{\partial \mathcal{L}}{\partial \mathbf{y}(T)}$.
The parameter gradients accumulate as:
Memory cost: $O(1)$ in depth -- we only need the current state, not the full trajectory. This is the key advantage that makes Neural ODEs practical for very deep (many-step) integration.
Compute cost: We must re-solve the forward ODE (or store checkpoints) to evaluate $\frac{\partial f}{\partial \mathbf{y}}$ at each backward step. This trades memory for computation.
Summary
- Neural ODEs replace discrete residual layers with a continuous ODE: $d\mathbf{y}/dt = f(t, \mathbf{y}; \boldsymbol{\theta})$.
- Euler and RK4 solvers trade off accuracy vs. compute (1 vs. 4 function evaluations per step).
- The ResNet-to-ODE connection is exact: a ResNet is a forward Euler discretization of an ODE.
- The adjoint method enables $O(1)$-memory training by solving a backward ODE instead of backpropagating through all solver steps.
In Part 2, we implement all of this from scratch in pure PyTorch: ODE solvers, the dynamics network, the Neural ODE layer, and a classifier for spiral data.