Neural ODEs on Biological Dynamics: Part 2 - Implementation

Overview

Part 1 argued that biological data lives in continuous time and that Neural ODEs are the right inductive bias for modelling it. Part 2 implements the experimental setup. We build a faithful FitzHugh-Nagumo simulator to generate synthetic trajectories, an autonomous Neural ODE to fit them, and a parameter-matched discrete RNN as a baseline. The interesting design choice — one line of code that decides whether the model extrapolates or not — is dropping $t$ from the ODE function's input.

Total implementation: three files, around $250$ lines including plotting. The Neural ODE machinery itself is about $50$ lines: an ODEFunc, an RK4 integrator, and a wrapper.

File Map

fitzhugh_nagumo.py contains the ground-truth dynamics and dataset builder. neural_ode.py contains the autonomous ODEFunc, the RK4 integrator, and the DiscreteRNN baseline. train.py handles training, phase-portrait visualisation, and the extrapolation test.

The ground-truth simulator is kept separate from the learnable model because they play different roles. The simulator is the oracle — it knows the true equations. The learnable model has to discover those equations from data. Keeping them in separate files makes it clear which is which and prevents accidental coupling (e.g., using the true RHS in the model definition would defeat the experiment).

The Ground-Truth FHN Simulator

A_PARAM, B_PARAM, TAU = 0.7, 0.8, 12.5

def fhn_rhs(state, I):
    v, w = state
    dv = v - (v ** 3) / 3.0 - w + I
    dw = (v + A_PARAM - B_PARAM * w) / TAU
    return np.array([dv, dw])

def rk4_step(rhs, state, I, dt):
    k1 = rhs(state, I)
    k2 = rhs(state + 0.5 * dt * k1, I)
    k3 = rhs(state + 0.5 * dt * k2, I)
    k4 = rhs(state + dt * k3, I)
    return state + (dt / 6.0) * (k1 + 2 * k2 + 2 * k3 + k4)

Six lines for the RHS, six lines for the RK4 integrator. The constants $A_\text{PARAM} = 0.7$, $B_\text{PARAM} = 0.8$, $\tau = 12.5$ are the standard FHN parameters from Nagumo et al. (1962); they produce the canonical spiking dynamics. Other choices give different (but still excitable) behaviour.

Why NumPy for the simulator. The simulator never needs gradients. It generates training data once, before the learnable model is trained. NumPy is faster and simpler than PyTorch for this offline computation. The Neural ODE itself, which needs autograd, uses PyTorch.

Why RK4 instead of Euler for the ground truth. Euler's $O(\Delta t)$ accuracy would introduce systematic error into the training data. RK4 is $O(\Delta t^4)$ — for $\Delta t = 0.5$, RK4 errors are around $10^{-8}$ per step. The learnable model never sees the systematic errors that Euler would introduce, so the experiment cleanly separates "the model learning the true dynamics" from "the model learning the simulator's quirks."

The Dataset Builder

def build_dataset(n_trajectories=40, traj_length=80, dt=0.5):
    states = []
    currents = []
    for _ in range(n_trajectories):
        v0 = np.random.uniform(-2, 2)
        w0 = np.random.uniform(-1, 1)
        I  = np.random.choice([0.2, 0.5])
        traj = [np.array([v0, w0])]
        for _ in range(traj_length):
            traj.append(rk4_step(fhn_rhs, traj[-1], I, dt))
        states.append(np.array(traj))
        currents.append(I)
    return torch.tensor(np.array(states), dtype=torch.float32), \
           torch.tensor(currents, dtype=torch.float32)

$40$ trajectories of $80$ time-steps each at $\Delta t = 0.5$. Half are spiking ($I = 0.5$), half are quiescent ($I = 0.2$). Random initial states cover the relevant region of the phase plane.

The output is a tensor of shape $(40, 81, 2)$ — 40 trajectories, $81$ time-step states ($80$ steps plus the initial state), $2$ state variables $(v, w)$ — and a tensor of shape $(40,)$ for the per-trajectory current values.

$80$ time-steps at $\Delta t = 0.5$ covers about $1.5$ FHN spike cycles in the spiking regime. This is long enough that the model has to learn the limit-cycle structure (not just memorise short segments) but short enough that training is fast. Production work would use longer trajectories.

The Neural ODE Function

This is the most important class in the project. The single design decision encoded here — what goes into the input — controls everything about whether the model extrapolates.

class ODEFunc(nn.Module):
    def __init__(self, state_dim=2, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim + 1, hidden), nn.Tanh(),
            nn.Linear(hidden, hidden), nn.Tanh(),
            nn.Linear(hidden, state_dim),
        )
        nn.init.zeros_(self.net[-1].weight)
        nn.init.zeros_(self.net[-1].bias)

    def forward(self, t, y, I):
        B = y.shape[0]
        I_vec = I.view(B, 1)
        inp = torch.cat([y, I_vec], dim=-1)
        return self.net(inp)

Critical design choice: the time argument $t$ is in the function signature but is not used in the forward computation. The original Neural ODE paper (Chen et al., 2018) includes $t$ in the network input to allow non-autonomous dynamics. We deliberately drop it.

Why drop $t$? FHN's true dynamics are autonomous — the right-hand side $f(v, w, I)$ depends on the current state and the input current, but not on absolute time. Including $t$ in the input would give the model the freedom to learn $f$ as a function of clock time as well as state. With limited training data, the model often does use this freedom: it memorises which $t$ values appeared in training and produces sensible outputs only on those $t$ values.

The numerical consequence is dramatic. With $t$ in the input, extrapolation MSE (predicting trajectories $4 \times$ longer than training) is roughly $5.0$. Without $t$ in the input, extrapolation MSE is $0.02$. A factor of $250$ from one line of code. The bias-toward-the-correct-physics was hiding in the input list.

The keyword-only signature. The forward signature is forward(self, t, y, I) — $t$ is the first argument even though it's unused. This matches the integrator's call convention (which always passes $t$, $y$, $I$ in that order) so we don't have to change the integrator when we toggle between autonomous and non-autonomous variants. Keeping the unused argument is sometimes called "API consistency" and is good practice for code you might want to ablate later.

The Zero-Init Trick

The lines

nn.init.zeros_(self.net[-1].weight)
nn.init.zeros_(self.net[-1].bias)

set the last layer's weights and biases to zero. At initialisation, the network output is exactly zero for any input. The integrator's first step produces $y_1 = y_0 + \Delta t \cdot 0 = y_0$, the second produces $y_2 = y_1 = y_0$, and so on. The model is the identity function over time.

Why is this helpful? Without zero-init, the random initial network produces random vector fields at the start of training. The integrator's first forward pass produces trajectories that diverge from the data wildly, and gradients have to spend the first few thousand iterations un-doing this random damage before any productive learning can start.

With zero-init, the integrator starts producing constant trajectories (the identity). Gradients then push the model to add structured non-zero behaviour — first the basic linear-ish dynamics, then the more complex curve-induced behaviour. Training is much more stable.

The same trick appears in many Neural ODE implementations (including the original paper's code) and in adaLN-Zero (which uses the same principle for Transformer blocks).

The RK4 Integrator

def rk4_integrate(func, y0, I, t_eval):
    states = [y0]
    y = y0
    for i in range(len(t_eval) - 1):
        t = t_eval[i]
        dt = t_eval[i + 1] - t_eval[i]
        k1 = func(t,           y,                 I)
        k2 = func(t + dt / 2,  y + 0.5 * dt * k1, I)
        k3 = func(t + dt / 2,  y + 0.5 * dt * k2, I)
        k4 = func(t + dt,      y + dt * k3,       I)
        y = y + (dt / 6.0) * (k1 + 2 * k2 + 2 * k3 + k4)
        states.append(y)
    return torch.stack(states, dim=0)

Four function evaluations per step, weighted-averaged with the standard RK4 coefficients. Returns a tensor of shape $(T, B, D)$ — time first, then batch, then state.

Why the time grid is variable-step capable. The integrator computes $\Delta t$ from the difference of consecutive $t_\text{eval}$ values rather than from a fixed step size. This means we can evaluate the same trained Neural ODE at arbitrary time grids — coarser, finer, irregular — by just changing the t_eval argument. Discrete RNNs cannot do this; they take fixed-size steps regardless of what time intervals you ask about.

Backpropagation through the integrator. Every $k_i$ involves a call to func, which is the learnable network. Gradients backpropagate through all four evaluations per step, through every step in the trajectory. For $80$ steps, that's $320$ network evaluations in the forward pass and the same number's worth of gradients in the backward pass. This is why Neural ODE training is slower than equivalent discrete-network training — there are more function calls per data point.

The Discrete RNN Baseline

class DiscreteRNN(nn.Module):
    def __init__(self, state_dim=2, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim + 1, hidden), nn.Tanh(),
            nn.Linear(hidden, hidden), nn.Tanh(),
            nn.Linear(hidden, state_dim),
        )
        nn.init.zeros_(self.net[-1].weight)
        nn.init.zeros_(self.net[-1].bias)

    def rollout(self, y0, I, n_steps):
        y = y0
        states = [y0]
        for _ in range(n_steps):
            inp = torch.cat([y, I.view(-1, 1)], dim=-1)
            y = y + self.net(inp)
            states.append(y)
        return torch.stack(states, dim=0)

Same architecture as the Neural ODE function. Same parameter count. Same initialisation. The only structural difference is in how it gets used: the RNN takes one residual step per .rollout() iteration, regardless of $\Delta t$.

This is the key control. By matching architecture, parameter count, and initialisation, we isolate exactly one variable: the discrete-vs-continuous-time structure of the update. Any performance difference between the Neural ODE and the discrete RNN is attributable to this difference, not to confounds in capacity or initialisation.

What the RNN's update is implicitly assuming. The RNN's update $y \leftarrow y + g(y, I)$ is computing $g$ as if it were the time-integrated dynamics over one $\Delta t$ — but it has no explicit knowledge of $\Delta t$. The training data was sampled at $\Delta t = 0.5$, so $g$ ends up encoding that specific step size in its weights. If you tried to use the trained RNN at $\Delta t = 0.25$ or $\Delta t = 1.0$, the predictions would be wrong by a factor proportional to the change in $\Delta t$.

The Neural ODE has no such hidden assumption. Its $f$ approximates the true instantaneous derivative $dy/dt$. You can integrate that with any step size you want at inference, including a different step size than training data was sampled at.

Training Loop

opt = torch.optim.AdamW(model.parameters(), lr=3e-3)
sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=epochs)
for epoch in range(epochs):
    y0 = states[:, 0, :]                  # initial state of every trajectory
    target = states                        # full trajectory
    pred = model(y0, I, t_eval).transpose(0, 1)   # (B, T, 2)
    loss = ((pred - target) ** 2).mean()
    opt.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    opt.step()
    sched.step()

Mean squared error between predicted and ground-truth trajectories. AdamW with $\eta = 3 \times 10^{-3}$, cosine schedule, gradient clipping at $1.0$. Standard training recipe.

Why full-trajectory loss instead of per-step loss. We compute the loss across the entire $80$-step trajectory, not just the next-step prediction. This is essential for the Neural ODE — its training signal includes the full integration error, not just the local truncation. A model that gets each step approximately right but drifts over $80$ steps would be punished by this loss; a per-step loss would not catch that drift.

Gradient clipping for RK4-through-80-steps stability. The RK4 integrator unrolls into $80$ steps $\times$ $4$ function evaluations = $320$ network applications. Without gradient clipping, occasional batches produce very large gradients that destabilise training. Clipping at $1.0$ keeps the optimizer in a useful regime.

Phase-Portrait Visualisation

The phase-portrait diagnostic is the part of this project that requires the most thought. Trajectory MSE is a noisy proxy for "did the model learn the dynamics" — a model can fit individual time series without learning the underlying flow. The phase-portrait visualisation answers the better question directly: query the model's predicted $f$ at a grid of $(v, w)$ points and compare against the ground-truth $f_\text{FHN}$ arrow-by-arrow.

pts = torch.tensor(np.stack([V.ravel(), W.ravel()], axis=-1))
I_t = torch.full((pts.size(0),), I)
with torch.no_grad():
    d = model.func(torch.zeros(()), pts, I_t)
DV_learn = d[:, 0].reshape(V.shape)
DW_learn = d[:, 1].reshape(W.shape)
# compare against fhn_rhs at the same grid points

The model's func attribute is the underlying ODE function (excluding the integrator). Querying it at grid points produces a flow field that we can compare against the ground truth pointwise.

The comparison metric: mean cosine similarity between the learned and ground-truth vector fields. Cosine similarity is appropriate here because magnitudes don't fully match (the learned field's magnitude depends on training dynamics) but directions should. A high cosine similarity means the model has learned the directions of the true flow — that is, the topology of the dynamical system.

Extrapolation Test

The extrapolation test asks: how well does the trained Neural ODE predict trajectories far longer than the training horizon?

truth_400 = simulate(init, I=0.5, n_steps=400, dt=0.5)   # 4x training
t_ext = torch.tensor(0.5 * np.arange(401))
with torch.no_grad():
    ode_out = ode_model(y0, I, t_ext)
    rnn_out = rnn_model(y0, I, t_ext)
# compute MSE on the segment past the training horizon

Training was on $80$-step trajectories. We test on $400$-step trajectories — $4 \times$ longer. The training never showed the model trajectories of this length; it has to extrapolate.

The autonomous-dynamics design choice from earlier gets stress-tested here. If the model overfit to the training time interval (which it would with $t$ in the input), the extrapolation past the training horizon would diverge. If it learned the true vector field, the extrapolation MSE should be roughly the same as the in-distribution MSE.

Summary

Ground-truth FHN simulator: $12$ lines of RHS + RK4 in NumPy.
Neural ODE: $50$ lines including ODEFunc, integrator, wrapper.
Discrete RNN baseline: $20$ lines.
Training + plotting: $\sim 200$ lines of glue.

The implementation is small. The single most consequential design choice — dropping $t$ from the ODE function's input to match FHN's autonomous structure — is one line of code that changes extrapolation MSE by a factor of $250$. The right inductive bias is worth orders of magnitude.

What Part 3 Tests

Part 3 trains the Neural ODE and the discrete RNN under identical conditions, reports their in-distribution and extrapolation MSE, and crucially measures phase-portrait recovery — the angle-wise similarity between each model's learned vector field and the ground-truth FHN flow. The result: both models fit the data equally; only the Neural ODE learned the physics.

Full code on GitHub: github.com/soveshmohapatra/Neural-ODE-Bio