Zefan-Cai/R-KV

[Neurips 2025] R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

/ 100

Emerging

Implements on-the-fly KV cache compression for reasoning models by jointly scoring tokens for importance (via attention weights) and non-redundancy (via key-vector cosine similarity), enabling 90% memory savings at near-full accuracy on math benchmarks. Training-free and framework-agnostic, it ranks tokens during decoding and maintains fixed-size buffers separate from the budgeted cache, achieving 6.6× throughput gains on long chain-of-thought generations. The method is currently being integrated into production inference stacks including vLLM, SGLang, and VeRL with support for DeepSeek-R1 and Qwen models.

1,183 stars.

No License No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 7 / 25

Community 23 / 25

How are scores calculated?

Stars

1,183

Forks

190

Language

Python

License

—

Related tools

snu-mllab/KVzip

[NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in...

codefuse-ai/ModelCache

A LLM semantic caching system aiming to enhance user experience by reducing response time via...

Explore Embedding Tools

All categories Trending Embeddings directory Insights