Deconstructing State Space Models: Part 1

Abstract

The quadratic cost of Self-Attention is a hard wall for long-context workloads -- genomic sequences, full-length books, high-frequency biosignals. State Space Models (SSMs), borrowed from classical control theory, sidestep this wall entirely with linear-time scaling and constant-memory inference. This post covers the math: continuous-time state equations, Zero-Order Hold discretization, the recurrence/convolution duality, and the data-dependent selection mechanism that defines Mamba.

1. Introduction: The Memory Wall of Transformers

Self-Attention compares every token to every other token. Let $L$ be the sequence length; both compute and memory scale as $\mathcal{O}(L^2)$. Double the context, quadruple the cost. For sequences in the tens of thousands of tokens, this becomes impractical on commodity hardware.

We need the representational capacity of Transformers with $\mathcal{O}(L)$ scaling. That is where State Space Models come in.

2. Control Theory 101: Continuous-Time State Space

SSMs originate in control theory, where they describe dynamical systems in robotics, aerospace, and fluid dynamics. Rather than cross-referencing all past inputs (as Attention does), an SSM compresses the sequence history into a fixed-size hidden state and updates it as new data arrives.

A continuous-time SSM maps a 1D input signal $x(t) \in \mathbb{R}$ to an output $y(t) \in \mathbb{R}$ through an $N$-dimensional latent state $h(t) \in \mathbb{R}^N$:

h'(t) = \mathbf{A} h(t) + \mathbf{B} x(t) $$ $$ y(t) = \mathbf{C} h(t) + \mathbf{D} x(t)

The matrices:

$\mathbf{A} \in \mathbb{R}^{N \times N}$ (State Matrix) -- governs how the hidden state evolves and how memory of past inputs decays. The single most important parameter.
$\mathbf{B} \in \mathbb{R}^{N \times 1}$ (Input Matrix) -- controls how $x(t)$ enters the hidden state.
$\mathbf{C} \in \mathbb{R}^{1 \times N}$ (Output Matrix) -- projects the hidden state into the output $y(t)$.
$\mathbf{D} \in \mathbb{R}^{1 \times 1}$ (Feedthrough Matrix) -- a skip connection from input to output, typically handled as a separate linear layer in practice.

3. From Analog to Digital: Discretization

Deep learning operates on discrete tokens $x_k$, not continuous signals. To bridge this gap, we discretize the continuous system using a timestep $\Delta$ and a rule called the Zero-Order Hold (ZOH):

\bar{\mathbf{A}} = \exp(\Delta \mathbf{A}) $$ $$ \bar{\mathbf{B}} = (\Delta \mathbf{A})^{-1} (\exp(\Delta \mathbf{A}) - \mathbf{I}) \cdot \Delta \mathbf{B}

The continuous ODEs now become a discrete recurrence:

h_k = \bar{\mathbf{A}} h_{k-1} + \bar{\mathbf{B}} x_k $$ $$ y_k = \mathbf{C} h_k + \mathbf{D} x_k

Because $\bar{\mathbf{A}}$ and $\bar{\mathbf{B}}$ are constant across the sequence, this is a Linear Time-Invariant (LTI) system -- and that property unlocks a useful duality.

4. The Duality: Recurrence vs. Convolution

An LTI SSM can be executed in two distinct modes, and you pick the one that suits your workload.

4.1 Recurrent Representation (Fast Inference)

At inference time, we step through the sequence like an RNN:

h_k = \bar{\mathbf{A}} h_{k-1} + \bar{\mathbf{B}} x_k

The state $h_k$ is a compressed summary of all prior context. Memory is $\mathcal{O}(1)$; generation cost is $\mathcal{O}(L)$.

4.2 Convolutional Representation (Fast Training)

During training, the full sequence $x = (x_1, x_2, \dots, x_L)$ is available at once. Because the system is linear, we can unroll the recurrence. Starting from $h_{-1} = 0$:

h_0 = \bar{\mathbf{B}} x_0 $$ $$ h_1 = \bar{\mathbf{A}} \bar{\mathbf{B}} x_0 + \bar{\mathbf{B}} x_1 $$ $$ h_2 = \bar{\mathbf{A}}^2 \bar{\mathbf{B}} x_0 + \bar{\mathbf{A}} \bar{\mathbf{B}} x_1 + \bar{\mathbf{B}} x_2

Substituting into $y_k = \mathbf{C} h_k$ yields a single 1D convolution over the entire sequence:

y = x * \bar{\mathbf{K}} \quad \text{where} \quad \bar{\mathbf{K}} = \left( \mathbf{C}\bar{\mathbf{B}}, \mathbf{C}\bar{\mathbf{A}}\bar{\mathbf{B}}, \mathbf{C}\bar{\mathbf{A}}^2\bar{\mathbf{B}}, \dots \right)

This lets us use optimized convolution primitives and FFT for parallel training, avoiding the sequential bottleneck of RNNs.

5. Mamba: Selective State Spaces

Standard SSMs are LTI: the same state transition applies to every token regardless of content. The model cannot selectively ignore filler tokens or emphasize important ones. Gu and Dao (2023) fixed this in Mamba by making $\mathbf{B}$, $\mathbf{C}$, and $\Delta$ functions of the input:

\mathbf{B}_k = \text{Linear}_B(x_k) $$ $$ \mathbf{C}_k = \text{Linear}_C(x_k) $$ $$ \Delta_k = \text{Softplus}(\text{Parameter} + \text{Linear}_\Delta(x_k))

The system is now data-dependent. $\Delta_k$ acts as a gate: a large $\Delta_k$ focuses on the current input (resetting memory), while a small $\Delta_k$ preserves existing state (ignoring the current token). This bridges the constant-memory efficiency of SSMs with the content-aware routing of Attention.

The tradeoff: making parameters input-dependent breaks the LTI assumption, so the convolution shortcut no longer applies. Mamba compensates with a hardware-aware parallel scan algorithm that keeps training speed competitive with Transformers while preserving linear scaling.

Conclusion

SSMs connect continuous dynamical systems to sequence modeling. The shift from $\mathcal{O}(L^2)$ attention to $\mathcal{O}(L)$ selective state spaces makes it practical to process sequences of millions of tokens.

In Part 2, we translate these equations into a working 1D State Space layer in under 100 lines of PyTorch.