Back to RWKV Hub

Deconstructing RWKV

Part 3: 3.2× Faster. 5.3× Less Memory.

Introduction

Parts 1 and 2 covered the math and the PyTorch implementation. Now: does RWKV actually deliver?

I ran all three models -- RWKV, Transformer, LSTM -- on synthetic next-token prediction, matched at ~50K parameters, trained for 50 epochs. RWKV hits 87.3% accuracy (vs. 88.9% Transformer), runs 3.2x faster at sequence length 256, and uses 5.3x less memory during inference.

Benchmark Setup

Task

Synthetic next-token prediction. Sequences of 32 to 256 tokens, vocabulary size 64, with repeating patterns that require both short-term and long-term memory.

Models

Parameter-matched as closely as possible:

Model Architecture ~Params
RWKV 4 layers, embed_dim=128, expand_factor=4 ~50K
Transformer 4 layers, embed_dim=128, 4 heads, dim_ff=512 ~50K
LSTM 4 layers, hidden_dim=128 ~50K

AdamW ($\beta_1=0.9$, $\beta_2=0.95$), lr $10^{-3}$ with cosine annealing, batch size 32, gradient clipping at norm 1.0.

Training Convergence

Model Final Train Loss Final Test Loss Test Accuracy
RWKV 0.234 0.289 87.3%
Transformer 0.198 0.267 88.9%
LSTM 0.312 0.378 82.1%

The Transformer's 1.6% accuracy edge over RWKV is expected -- full attention is strictly more expressive than linear attention. RWKV beats the LSTM by 5.2 points. The LSTM also showed a larger train-test gap (0.066 vs. RWKV's 0.055), suggesting earlier overfitting.

Inference Latency

Per-token latency across sequence lengths:

Sequence Length RWKV (ms) Transformer (ms) LSTM (ms)
32 0.42 0.38 0.51
64 0.43 0.52 0.53
128 0.44 0.81 0.55
256 0.45 1.43 0.58

Memory Usage

Sequence Length RWKV (MB) Transformer (MB) LSTM (MB)
32 12.4 14.2 13.1
64 12.5 21.8 13.2
128 12.6 37.1 13.4
256 12.8 67.5 13.7

RWKV holds steady at ~12.5 MB. The Transformer climbs from 14.2 to 67.5 MB -- a 4.75x increase over the same range, and 5.3x more than RWKV at length 256. In practice this means RWKV can serve longer contexts on the same hardware and fit larger batches during inference.

RWKV vs Transformer vs LSTM: training loss, test loss, inference latency, and accuracy across 50 epochs and sequence lengths 32-256.

Training loss (top-left), test loss (top-right), per-token inference latency vs. sequence length (bottom-left), and final test accuracy (bottom-right).

Trade-offs

When to Use RWKV

Good fit:

Stick with Transformers for complex reasoning (math, multi-hop QA), short-context tasks, or when you need to fine-tune an existing pre-trained checkpoint.

Conclusion

RWKV does what it claims: Transformer-grade training, RNN-grade inference. At sequence length 256 on our ~50K-parameter models:

The gap between RWKV and Transformers is small on accuracy and large on efficiency. For long-context, latency-sensitive workloads, that is the right trade-off.