Zefan-Cai/R-KV
[Neurips 2025] R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
Implements on-the-fly KV cache compression for reasoning models by jointly scoring tokens for importance (via attention weights) and non-redundancy (via key-vector cosine similarity), enabling 90% memory savings at near-full accuracy on math benchmarks. Training-free and framework-agnostic, it ranks tokens during decoding and maintains fixed-size buffers separate from the budgeted cache, achieving 6.6× throughput gains on long chain-of-thought generations. The method is currently being integrated into production inference stacks including vLLM, SGLang, and VeRL with support for DeepSeek-R1 and Qwen models.
1,183 stars.
Stars
1,183
Forks
190
Language
Python
License
—
Category
Last pushed
Oct 16, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/Zefan-Cai/R-KV"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.