Zefan-Cai/R-KV

[Neurips 2025] R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

46
/ 100
Emerging

Implements on-the-fly KV cache compression for reasoning models by jointly scoring tokens for importance (via attention weights) and non-redundancy (via key-vector cosine similarity), enabling 90% memory savings at near-full accuracy on math benchmarks. Training-free and framework-agnostic, it ranks tokens during decoding and maintains fixed-size buffers separate from the budgeted cache, achieving 6.6× throughput gains on long chain-of-thought generations. The method is currently being integrated into production inference stacks including vLLM, SGLang, and VeRL with support for DeepSeek-R1 and Qwen models.

1,183 stars.

No License No Package No Dependents
Maintenance 6 / 25
Adoption 10 / 25
Maturity 7 / 25
Community 23 / 25

How are scores calculated?

Stars

1,183

Forks

190

Language

Python

License

Last pushed

Oct 16, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/Zefan-Cai/R-KV"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.