Back to Projects

Deconstructing RWKV

Explore linear attention theory, the WKV recurrence operator, a pure PyTorch implementation with both parallel training and O(1) recurrent inference, and head-to-head benchmarks against Transformers and LSTMs.

Part 1

Linear Attention & WKV

How attention becomes a linear recurrence—the WKV operator, time mixing, channel mixing, and why RWKV trains like a Transformer but infers like an RNN.

Part 2

PyTorch Implementation

Building a complete RWKV model in pure PyTorch—parallel training mode, recurrent inference mode, and autoregressive generation. No external libraries required.
View Code on GitHub

Part 3

3.2× Faster. 5.3× Less Memory.

Benchmarking RWKV against Transformer and LSTM baselines on sequence modeling—measuring training convergence, O(1) inference latency, and memory efficiency.