Deconstructing Neural ODEs: Part 3 - Infinite Layers, Finite Memory

Introduction

Parts 1 and 2 covered the math and the code. Here we train on the spiral classification benchmark and look at what actually happens.

Three models: Neural ODE (Euler) with 20 Euler steps, Neural ODE (RK4) with 20 RK4 steps, and a ResNet MLP with 20 discrete residual blocks. All share the same skeleton -- lift 2D input to 6D, apply 20 "layers" of dynamics, classify -- differing only in whether those layers share weights (ODE) or have independent weights (ResNet).

The Spiral Dataset

We generate 1,000 points from two interleaving Archimedean spirals:

r = \frac{\theta}{3\pi}, \qquad x = r\cos\theta + \varepsilon, \qquad y = r\sin\theta + \varepsilon,

where $\theta \in [0, 3\pi]$ (1.5 full turns) and $\varepsilon \sim \mathcal{N}(0, 0.1)$. The second spiral is rotated by $\pi$, creating two interleaving classes.

The decision boundary is not linearly separable, not even convex -- it spirals around the origin. Any model that solves this needs a highly nonlinear transformation.

Results

Training configuration: 500 epochs, Adam optimizer with initial learning rate 0.01, cosine annealing schedule, batch size 128.

Model	Accuracy	Parameters	Time	Total NFE
Neural ODE (Euler)	89.4%	18,906	23.6s	80,000
Neural ODE (RK4)	90.9%	18,906	91.5s	320,000
ResNet MLP	92.9%	364,236	35.3s	--

Key Takeaways

Parameter efficiency. 18,906 vs. 364,236 parameters -- a $19.3\times$ gap -- and the Neural ODE still lands within 3 points of the ResNet. Weight sharing (one $f(\cdot;\boldsymbol{\theta})$ evaluated 20 times vs. 20 independent blocks) accounts for all of it.

Euler vs. RK4. RK4 gains 1.5 points over Euler (90.9% vs. 89.4%), but costs $4\times$ the function evaluations (320k vs. 80k) and $3.9\times$ the wall-clock time (91.5s vs. 23.6s). For this problem, Euler gives better accuracy per FLOP. RK4's advantage shows up more when the dynamics are stiff or the step count is low.

ResNet advantage. The ResNet MLP hits 92.9% -- 2 points above the best Neural ODE. Independent weights per block give strictly more capacity. Whether that 2% gap justifies $19\times$ more parameters depends on the application.

Trajectory Analysis

For each input point $\mathbf{x} \in \mathbb{R}^2$, we lift to $\mathbb{R}^6$ and plot the ODE trajectory from $t=0$ to $t=1$, projected onto the first two principal components.

At $t=0$ the two classes are interleaved. As $t$ increases, the flow pulls them apart. By $t=1$, they sit in distinct regions. The trajectories are smooth -- no discrete jumps -- which is the geometric point of a Neural ODE: it learns a continuous deformation that untangles the data into a linearly separable configuration.

Decision boundary plots show the same story: the Neural ODE produces smooth boundaries that follow the spiral, while the ResNet (with $19\times$ more parameters) fits slightly tighter.

Memory Comparison

Model	Parameters	Peak Memory
ODE (Euler)	18,906	11.5 KB
ODE (RK4)	18,906	30.2 KB
ResNet MLP	364,236	6.1 KB

RK4 uses $2.6\times$ more memory than Euler because each step creates four intermediate tensors ($\mathbf{k}_1$ through $\mathbf{k}_4$) that PyTorch retains for backprop. Euler creates one.

Parameter storage vs. activation memory. The ResNet stores $19.3\times$ more parameters, but its peak activation memory is actually lower -- each block is independent, so gradients can be computed locally. The Neural ODE's unrolled solver builds a long computation graph that all stays in memory.

This is the problem the adjoint method addresses. Treating the backward pass as another ODE (solved in reverse time) cuts activation memory from $O(N)$ to $O(1)$ in solver steps, at the cost of recomputing the forward trajectory.

Connections to Other Architectures

ResNets as Euler Discretizations

The ResNet MLP with h = h + (1/n_blocks) * block(h) is a forward Euler solver with $\Delta t = 1/20$. The only difference: each step has its own weights, while the Neural ODE reuses one set.

Liquid Neural Networks

Liquid Neural Networks (Hasani et al., 2021) add state-dependent time constants:

\frac{d\mathbf{h}}{dt} = -\frac{\mathbf{h}}{\tau(\mathbf{h})} + f(\mathbf{h}, \mathbf{x}; \boldsymbol{\theta}).

The extra damping/memory term is the main difference. The solvers from Part 2 work here with a modified dynamics function.

Normalizing Flows

ODE solutions are invertible (integrate forward to encode, backward to decode), so they can serve as continuous normalizing flows (CNFs). The change-of-variables formula for an ODE flow involves the trace of the Jacobian:

\log p(\mathbf{x}) = \log p(\mathbf{z}) - \int_0^T \mathrm{tr}\!\left(\frac{\partial f}{\partial \mathbf{y}}\right) dt,

where $\mathbf{z} = \mathbf{y}(T)$. This sidesteps the triangular Jacobian constraints required by discrete normalizing flows.

Lessons Learned

Dimensional lifting is essential. A 2D ODE cannot separate spirals (non-crossing property). Lifting to 6D gives the flow room to work.
Zero initialization matters. Starting at $d\mathbf{y}/dt \approx 0$ (identity map) stabilizes early training. Without it, random dynamics push states to extreme values and the solver blows up.
Weight sharing cuts both ways. $19\times$ fewer parameters, but the Neural ODE cannot learn different transformations at different depths -- just one dynamics function evaluated at different times.
Solver choice trades accuracy for compute. RK4 costs $4\times$ per step. For many problems, Euler with more steps beats RK4 with fewer.
The ResNet connection is exact. It is a mathematical identity, not a metaphor, and it lets you transfer intuitions between discrete and continuous architectures.

Conclusion

From-scratch Neural ODEs on the spiral benchmark:

Neural ODE (RK4): 90.9% accuracy, 18,906 parameters.
ResNet MLP: 92.9% accuracy, 364,236 parameters ($19.3\times$ more).
RK4 beats Euler by 1.5 points at $4\times$ the compute.

The practical upshot: Neural ODEs trade a small accuracy gap for large parameter savings, invertibility, and adaptive depth. The ODE formulation also brings along a century of numerical analysis tooling -- error bounds, adaptive step-size control, stiffness detection -- that discrete architectures have to reinvent.

Deconstructing Neural ODEs from Scratch

Part 3: Infinite Layers, Finite Memory