Back to LNNs Hub

Deconstructing Liquid Neural Networks (LNNs)

Part 2: Writing a Liquid Layer in Pure PyTorch

Introduction

Part 1 covered the continuous-time math behind LTC networks--how they replace discrete state transitions with ODEs governed by dynamic time constants.

Now we force that continuous math onto discrete hardware. Below is a minimal, working Liquid layer in PyTorch.

Translating ODEs to Discrete Steps

The LTC ODE cannot run directly on digital hardware. We approximate it with a forward Euler step:

$$ x(t + \Delta t) = x(t) + \Delta t \cdot \left[ - \frac{x(t)}{\tau_{sys}} + A \cdot f(x(t), I(t)) \right] $$

$f$ is the output of a nonlinear layer, $A$ is a steady-state parameter, and $\tau_{sys}$ is the dynamic, data-dependent time constant from Part 1.

The PyTorch Implementation

Unlike an LSTM's four gate matrices, this layer only needs to learn $f$ (one linear layer + sigmoid) and the base time-constant $\tau$.

import torch
import torch.nn as nn

class LiquidLayer(nn.Module):
    def __init__(self, input_size: int, state_size: int, dt: float = 0.1):
        super().__init__()
        self.dt = dt
        self.fc = nn.Linear(input_size + state_size, state_size)
        self.A = nn.Parameter(torch.randn(state_size))
        self.tau = nn.Parameter(torch.ones(state_size))

    def forward(self, x: torch.Tensor, state: torch.Tensor):
        combined = torch.cat([x, state], dim=1)
        f = torch.sigmoid(self.fc(combined))

        # Calculate dynamic time constant
        tau_val = nn.functional.softplus(self.tau)
        tau_sys = tau_val / (1.0 + tau_val * f)

        # Evaluate continuous derivative and take a discrete Euler step
        dx_dt = - (state / tau_sys) + self.A * f
        new_state = state + self.dt * dx_dt

        return new_state

Analyzing the Code

The only weight matrix is self.fc, which computes $f$. Everything else--$\tau$, $A$--is a per-neuron parameter vector. The network does not memorize sequences in a static matrix. Instead, it learns to modulate $\tau_{sys}$ through $f$: fast-changing inputs shrink the time constant for quick adaptation; stable inputs raise it to hold memory longer.

Parameter Count Breakdown

Consider a concrete example: input_size=1 and state_size=8. The linear layer self.fc maps from input_size + state_size = 9 to state_size = 8, giving us $9 \times 8 + 8 = 80$ parameters (weights + bias). Then self.A adds 8 parameters and self.tau adds 8 more. That totals 96 learnable parameters for the liquid layer itself. Add a final output head (nn.Linear(8, 1) = 9 parameters) and you reach 105 parameters for a complete sequence-to-prediction model.

Compare that to a standard nn.LSTM with the same hidden size of 8. An LSTM has four internal gates (input, forget, cell, output), each with its own weight matrix for input and hidden state plus a bias. That works out to $4 \times (1 \times 8 + 8 \times 8 + 8) = 4 \times 80 = 1{,}224$ parameters for the recurrent cell alone, plus 9 for the output head: 1,233 total. The LSTM needs roughly 12x more parameters than the Liquid layer to process the same input.

Why Softplus on Tau?

The base time constant self.tau is initialized to ones, but the raw parameter can drift negative during training. We apply softplus to guarantee a strictly positive value:

$$ \text{softplus}(\tau) = \ln(1 + e^{\tau}) $$

This is smoother than ReLU near zero and avoids the dead-neuron problem. The resulting tau_val is always positive, which is physically necessary--a negative time constant would cause the ODE to diverge exponentially instead of settling toward an attractor.

The Euler Step in Detail

The forward pass executes a single forward Euler step. Walking through it line by line:

  1. Concatenation: the current input $x$ and the previous hidden state are concatenated into one vector. This gives the nonlinear mapping $f$ access to both the external signal and the network's own memory.
  2. Sigmoid activation: $f = \sigma(W \cdot [x, \text{state}] + b)$. The sigmoid bounds $f$ to $(0, 1)$, which keeps $\tau_{sys}$ well-behaved. If $f$ were unbounded, the dynamic time constant could collapse to zero and blow up the $-\text{state}/\tau_{sys}$ term.
  3. Dynamic time constant: $\tau_{sys} = \tau_{\text{val}} / (1 + \tau_{\text{val}} \cdot f)$. When $f \to 0$ (stable input), $\tau_{sys} \approx \tau_{\text{val}}$ and the neuron changes slowly. When $f \to 1$ (volatile input), $\tau_{sys}$ shrinks, making the neuron react faster.
  4. Derivative + step: the ODE derivative $dx/dt = -\text{state}/\tau_{sys} + A \cdot f$ combines a leak term (pulling state toward zero) and a drive term (pushing state toward the attractor $A$). Multiplying by self.dt and adding to the current state completes the Euler update.

Choosing the Step Size dt

The default dt = 0.1 is a balance between accuracy and compute cost. Smaller values (e.g., 0.01) give a tighter approximation of the true ODE solution but require more sub-steps per sequence element. Larger values risk numerical instability--the Euler method is only first-order accurate, and if dt exceeds the fastest time scale in the system, the integration oscillates or diverges. For our chaotic benchmark task, dt = 0.1 was stable and converged without needing higher-order solvers like Runge-Kutta 4.

Wrapping the Layer into a Full Model

The LiquidLayer processes one time step at a time. To handle a full sequence, we loop over time steps, feeding the output state of each step as the input state to the next:

class LiquidNet(nn.Module):
    def __init__(self, input_size, state_size, output_size):
        super().__init__()
        self.liquid = LiquidLayer(input_size, state_size)
        self.head = nn.Linear(state_size, output_size)
        self.state_size = state_size

    def forward(self, x_seq):
        batch = x_seq.size(0)
        state = torch.zeros(batch, self.state_size, device=x_seq.device)
        for t in range(x_seq.size(1)):
            state = self.liquid(x_seq[:, t, :], state)
        return self.head(state)

The hidden state starts at zero and accumulates the trajectory's dynamics step by step through the ODE. At the final time step, the accumulated state passes through a single linear head to produce the prediction. This is the full model architecture used in the Part 3 benchmark--105 parameters total, ready to go head-to-head with an LSTM.