// the find
RWKV/rwkv.cpp
INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
rwkv.cpp is a ggml-based CPU inference engine for RWKV language models, supporting INT4/INT5/INT8 quantization and FP16. RWKV is an RNN-style architecture with O(1) per-token inference cost at generation time, which makes it genuinely more CPU-friendly than transformer attention on long contexts. Target audience is people who want to run LLMs locally without a GPU.
RWKV's recurrent architecture means memory usage stays flat regardless of context length — no KV cache growing with every token. The quantization benchmarks are honest and include perplexity numbers, not just speed claims. Supports v4 through v7 model architectures, so it tracks the upstream RWKV releases. The C API in rwkv.h is clean and minimal, making bindings to other languages straightforward.
Last commit was March 2025 and the README explicitly has a TODO to update the benchmark table — the performance numbers shown are for ancient v4 models on a 4C CPU, not current hardware. The project is essentially a fork of an older ggml snapshot rather than tracking upstream llama.cpp's ggml, which means it won't get AVX-512, ARM NEON, or Metal improvements for free. Only two language bindings exist (Go and Node.js) and both are third-party with no guarantee they track this repo's API. The Python wrapper requires manual model conversion from PyTorch before you can do anything, adding friction compared to projects that accept GGUF directly.