Back to Neural ODEs Hub

Deconstructing Neural ODEs from Scratch

Part 3: Infinite Layers, Finite Memory

Introduction

In Part 1 we built the math; in Part 2 we wrote the code. Now we train Neural ODEs on the classic spiral classification benchmark and dissect every result.

Our experiments compare three models: Neural ODE (Euler) with 20 Euler integration steps, Neural ODE (RK4) with 20 RK4 integration steps, and a ResNet MLP with 20 discrete residual blocks. All share the same architecture skeleton: lift 2D input to 6D, apply 20 "layers" of dynamics, then classify. The only difference is whether those layers share weights (ODE) or have independent weights (ResNet).

The Spiral Dataset

We generate 1,000 points from two interleaving Archimedean spirals:

$$ r = \frac{\theta}{3\pi}, \qquad x = r\cos\theta + \varepsilon, \qquad y = r\sin\theta + \varepsilon, $$

where $\theta \in [0, 3\pi]$ (1.5 full turns) and $\varepsilon \sim \mathcal{N}(0, 0.1)$. The second spiral is rotated by $\pi$, creating two interleaving classes.

This is a canonical nonlinear classification benchmark. The decision boundary is not linearly separable, not even convex -- it is a continuous curve that spirals around the origin. Any model that solves this must learn a highly nonlinear transformation.

Results

Training configuration: 500 epochs, Adam optimizer with initial learning rate 0.01, cosine annealing schedule, batch size 128.

Model Accuracy Parameters Time Total NFE
Neural ODE (Euler) 89.4% 18,906 23.6s 80,000
Neural ODE (RK4) 90.9% 18,906 91.5s 320,000
ResNet MLP 92.9% 364,236 35.3s --

Key Takeaways

Parameter efficiency. The Neural ODE uses $19.3\times$ fewer parameters than the ResNet MLP (18,906 vs. 364,236) while achieving accuracy within 3 percentage points. This is the direct consequence of weight sharing: one function $f(\cdot;\boldsymbol{\theta})$ evaluated 20 times vs. 20 independent blocks.

Euler vs. RK4. RK4 gains 1.5 percentage points over Euler (90.9% vs. 89.4%), but at a significant computational cost:

The advantage of RK4 is numerical precision: with the same number of steps, it solves the ODE more accurately. This matters more for stiff dynamics or when using fewer steps.

ResNet advantage. The ResNet MLP reaches 92.9% -- 2 points above the best Neural ODE. This is expected: independent weights per block provide strictly more representational capacity. The Neural ODE's constraint (all "layers" share the same function) limits what dynamics it can express. The question is whether that 2% gap is worth $19\times$ more parameters.

Trajectory Analysis

The most illuminating visualization is the ODE trajectory plot. For each input point $\mathbf{x} \in \mathbb{R}^2$, we lift to $\mathbb{R}^6$ and plot the full ODE trajectory from $t=0$ to $t=1$, projected onto the first two principal components.

What we observe:

This is the geometric interpretation of a Neural ODE: it learns a continuous deformation of the feature space that untangles the data. Unlike a discrete ResNet, which applies a sequence of abrupt transformations, the Neural ODE flows the data smoothly from an entangled configuration to a linearly separable one.

The decision boundary plots confirm this: the Neural ODE produces smooth, continuous boundaries that follow the spiral structure, while the ResNet MLP (with more parameters) achieves a slightly tighter fit.

Memory Comparison

Model Parameters Peak Memory
ODE (Euler) 18,906 11.5 KB
ODE (RK4) 18,906 30.2 KB
ResNet MLP 364,236 6.1 KB

Two observations:

RK4 uses $2.6\times$ more memory than Euler. Each RK4 step creates four intermediate tensors ($\mathbf{k}_1$ through $\mathbf{k}_4$) that must be retained for backpropagation. Euler creates only one.

Parameter storage vs. activation memory. The ResNet MLP stores 364,236 parameters ($19.3\times$ more), but its forward pass peak memory is actually lower because each block is independent and gradients can be computed locally. The Neural ODE's unrolled solver creates a long computation graph that PyTorch must retain.

This is precisely the problem the adjoint method solves. By treating the backward pass as another ODE (solved in reverse time), the adjoint method reduces activation memory from $O(N)$ to $O(1)$ in the number of solver steps, at the cost of recomputing the forward trajectory.

Connections to Other Architectures

ResNets as Euler Discretizations

Our results make the connection concrete. The ResNet MLP with step scaling h = h + (1/n_blocks) * block(h) is literally a forward Euler solver with $\Delta t = 1/20$. The difference: each "step" has its own weights, whereas the Neural ODE reuses one set.

Liquid Neural Networks

Liquid Neural Networks (Hasani et al., 2021) extend the Neural ODE idea with state-dependent time constants:

$$ \frac{d\mathbf{h}}{dt} = -\frac{\mathbf{h}}{\tau(\mathbf{h})} + f(\mathbf{h}, \mathbf{x}; \boldsymbol{\theta}). $$

This adds a natural damping/memory term that Neural ODEs lack. The ODE solvers we built here would work for Liquid Neural Networks with a modified dynamics function.

Normalizing Flows

Since ODE solutions are invertible (integrate forward to encode, backward to decode), Neural ODEs form the basis of continuous normalizing flows (CNFs). The change of variables formula for an ODE flow involves the trace of the Jacobian:

$$ \log p(\mathbf{x}) = \log p(\mathbf{z}) - \int_0^T \mathrm{tr}\!\left(\frac{\partial f}{\partial \mathbf{y}}\right) dt, $$

where $\mathbf{z} = \mathbf{y}(T)$. This avoids the triangular Jacobian constraints of discrete normalizing flows.

Lessons Learned

  1. Dimensional lifting is essential. A 2D ODE cannot separate spirals due to the non-crossing property. Lifting to 6D provides the extra degrees of freedom the flow needs.
  2. Zero initialization matters. Starting the dynamics at $d\mathbf{y}/dt \approx 0$ (identity transformation) dramatically helps optimization. Without it, the random initial dynamics can push states to extreme values, causing solver instability.
  3. Weight sharing is a double-edged sword. It gives $19\times$ parameter efficiency but limits expressivity. The Neural ODE cannot learn different transformations at different depths -- only one shared dynamics function evaluated at different times.
  4. Solver choice trades accuracy for compute. RK4 is more accurate per step but costs $4\times$ the computation. For many practical problems, Euler with more steps may be preferable to RK4 with fewer.
  5. The ResNet connection is not just an analogy. It is a precise mathematical relationship. Understanding it helps transfer intuitions between discrete and continuous architectures.

Conclusion

We built Neural ODEs from scratch: Euler and RK4 solvers, a learnable dynamics network, dimensional lifting, and a classification pipeline. On the spiral benchmark:

Neural ODEs are not just a theoretical curiosity. They offer a principled framework for parameter-efficient deep learning, invertible transformations, and adaptive-depth computation. The mathematics of differential equations -- developed over centuries -- provides tools, intuitions, and guarantees that discrete architectures lack.

Every line of code is available in the repository. No black boxes.