Part 1 covered the continuous-time math behind LTC networks--how they replace discrete state transitions with ODEs governed by dynamic time constants.
Now we force that continuous math onto discrete hardware. Below is a minimal, working Liquid layer in PyTorch.
Translating ODEs to Discrete Steps
The LTC ODE cannot run directly on digital hardware. We approximate it with a forward Euler step:
$f$ is the output of a nonlinear layer, $A$ is a steady-state parameter, and $\tau_{sys}$ is the dynamic, data-dependent time constant from Part 1.
The PyTorch Implementation
Unlike an LSTM's four gate matrices, this layer only needs to learn $f$ (one linear layer + sigmoid) and the base time-constant $\tau$.
import torch
import torch.nn as nn
class LiquidLayer(nn.Module):
def __init__(self, input_size: int, state_size: int, dt: float = 0.1):
super().__init__()
self.dt = dt
self.fc = nn.Linear(input_size + state_size, state_size)
self.A = nn.Parameter(torch.randn(state_size))
self.tau = nn.Parameter(torch.ones(state_size))
def forward(self, x: torch.Tensor, state: torch.Tensor):
combined = torch.cat([x, state], dim=1)
f = torch.sigmoid(self.fc(combined))
# Calculate dynamic time constant
tau_val = nn.functional.softplus(self.tau)
tau_sys = tau_val / (1.0 + tau_val * f)
# Evaluate continuous derivative and take a discrete Euler step
dx_dt = - (state / tau_sys) + self.A * f
new_state = state + self.dt * dx_dt
return new_state
Analyzing the Code
The only weight matrix is self.fc, which computes $f$. Everything else--$\tau$, $A$--is
a per-neuron parameter vector. The network does not memorize sequences in a static matrix. Instead,
it learns to modulate $\tau_{sys}$ through $f$: fast-changing inputs shrink the time constant for
quick adaptation; stable inputs raise it to hold memory longer.
Parameter Count Breakdown
Consider a concrete example: input_size=1 and state_size=8. The linear
layer self.fc maps from input_size + state_size = 9 to state_size = 8,
giving us $9 \times 8 + 8 = 80$ parameters (weights + bias). Then self.A adds 8 parameters and
self.tau adds 8 more. That totals 96 learnable parameters for the liquid
layer itself. Add a final output head (nn.Linear(8, 1) = 9 parameters) and you reach
105 parameters for a complete sequence-to-prediction model.
Compare that to a standard nn.LSTM with the same hidden size of 8. An LSTM has four
internal gates (input, forget, cell, output), each with its own weight matrix for input and hidden
state plus a bias. That works out to $4 \times (1 \times 8 + 8 \times 8 + 8) = 4 \times 80 = 1{,}224$
parameters for the recurrent cell alone, plus 9 for the output head: 1,233 total.
The LSTM needs roughly 12x more parameters than the Liquid layer to process the same input.
Why Softplus on Tau?
The base time constant self.tau is initialized to ones, but the raw parameter can drift
negative during training. We apply softplus to guarantee a strictly positive value:
This is smoother than ReLU near zero and avoids the dead-neuron problem. The resulting
tau_val is always positive, which is physically necessary--a negative time constant
would cause the ODE to diverge exponentially instead of settling toward an attractor.
The Euler Step in Detail
The forward pass executes a single forward Euler step. Walking through it line by line:
- Concatenation: the current input $x$ and the previous hidden state are concatenated into one vector. This gives the nonlinear mapping $f$ access to both the external signal and the network's own memory.
- Sigmoid activation: $f = \sigma(W \cdot [x, \text{state}] + b)$. The sigmoid bounds $f$ to $(0, 1)$, which keeps $\tau_{sys}$ well-behaved. If $f$ were unbounded, the dynamic time constant could collapse to zero and blow up the $-\text{state}/\tau_{sys}$ term.
- Dynamic time constant: $\tau_{sys} = \tau_{\text{val}} / (1 + \tau_{\text{val}} \cdot f)$. When $f \to 0$ (stable input), $\tau_{sys} \approx \tau_{\text{val}}$ and the neuron changes slowly. When $f \to 1$ (volatile input), $\tau_{sys}$ shrinks, making the neuron react faster.
- Derivative + step: the ODE derivative $dx/dt = -\text{state}/\tau_{sys} + A \cdot f$
combines a leak term (pulling state toward zero) and a drive term (pushing state toward the attractor
$A$). Multiplying by
self.dtand adding to the current state completes the Euler update.
Choosing the Step Size dt
The default dt = 0.1 is a balance between accuracy and compute cost. Smaller values
(e.g., 0.01) give a tighter approximation of the true ODE solution but require more sub-steps per
sequence element. Larger values risk numerical instability--the Euler method is only first-order
accurate, and if dt exceeds the fastest time scale in the system, the integration
oscillates or diverges. For our chaotic benchmark task, dt = 0.1 was stable and
converged without needing higher-order solvers like Runge-Kutta 4.
Wrapping the Layer into a Full Model
The LiquidLayer processes one time step at a time. To handle a full sequence, we loop
over time steps, feeding the output state of each step as the input state to the next:
class LiquidNet(nn.Module):
def __init__(self, input_size, state_size, output_size):
super().__init__()
self.liquid = LiquidLayer(input_size, state_size)
self.head = nn.Linear(state_size, output_size)
self.state_size = state_size
def forward(self, x_seq):
batch = x_seq.size(0)
state = torch.zeros(batch, self.state_size, device=x_seq.device)
for t in range(x_seq.size(1)):
state = self.liquid(x_seq[:, t, :], state)
return self.head(state)
The hidden state starts at zero and accumulates the trajectory's dynamics step by step through the ODE. At the final time step, the accumulated state passes through a single linear head to produce the prediction. This is the full model architecture used in the Part 3 benchmark--105 parameters total, ready to go head-to-head with an LSTM.