Deconstructing Echo State Networks: Part 2 - PyTorch Implementation

Introduction

In Part 1, we established that Backpropagation Through Time (BPTT) is slow and mathematically fragile due to vanishing gradients. We proposed Echo State Networks (ESNs) as the solution: initializing a massive, sparse, randomly connected RNN—the "Reservoir"—freezing it permanently, and only training the final readout layer using closed-form Ridge Regression.

Today, we take that math and translate it into a pure PyTorch implementation.

View Full Setup on GitHub

Initializing the Sparse Reservoir

The defining characteristic of an ESN is its reservoir matrix $W_{res}$. It needs to be large (often hundreds or thousands of neurons) and highly sparse.

# 1. Initialize sparse random matrix
W = torch.rand(hidden_size, hidden_size) - 0.5
mask = torch.rand(hidden_size, hidden_size) < sparsity
W = W * mask

But this matrix alone will either explode or vanish. To guarantee the Echo State Property, we must scale it so that its Spectral Radius (the maximum absolute eigenvalue) equals a target value, strictly less than 1.

# 2. Scale W_res to achieve the desired Spectral Radius
eigenvalues = torch.linalg.eigvals(W)
current_sr = torch.max(torch.abs(eigenvalues)).item()

if current_sr > 0:
    W = W * (spectral_radius / current_sr)
    
self.W_res = nn.Parameter(W, requires_grad=False)

By tracking requires_grad=False, we explicitly tell PyTorch never to attempt gradient descent on this matrix.

The State Harvesting Loop

Because we aren't using BPTT, our forward pass doesn't calculate gradients. Its only job is to run the input sequence through the frozen reservoir and collect (harvest) the internal non-linear states $X$ over time.

# Leaky integrator ESN state update
update = torch.tanh(u_t @ self.W_in.T + state @ self.W_res.T)
state = (1 - leaky_rate) * state + leaky_rate * update
states.append(state)

Closed-Form Ridge Regression Readout

The most elegant part of an ESN is the fit() method. We collect all the internal states $X$ generated by the input sequence $Y_{target}$. To find the perfect output weights $W_{out}$ that map $X \to Y$, we simply solve the normal equations for Ridge Regression (Tikhonov Regularization):

W_{out} = Y^T X (X^T X + \lambda \mathbf{I})^{-1}

In PyTorch, we implement this natively using torch.linalg.solve, which is highly optimized for solving linear systems:

# Add a bias term to the harvested states
S = torch.cat([bias, U, X], dim=1) 
identity = torch.eye(S.shape[1], device=S.device)

# S^T * S and S^T * Y
STS = S.T @ S
STY = S.T @ Y

# Solve (X^T X + lambda I) W^T = X^T Y
W_out_T = torch.linalg.solve(STS + ridge_lambda * identity, STY)

# Assign parameters instantly
self.W_out.data = W_out_T.T

Next Steps: The Chaotic Benchmark

We now have a complete mathematical theory and a pure PyTorch 2.0 implementation of a Reservoir Computer. The entire training process finishes in a single deterministic matrix inversion—no epochs, no learning rates, no vanishing gradients.

In Part 3, we will put this to the ultimate test: benchmarking our ESN against a standard PyTorch LSTM on predicting the highly chaotic Mackey-Glass time series system.