Awesome-KV-Cache-Compression and Awesome-LLM-KV-Cache

These are complements that serve different aspects of the same problem space: one curates papers specifically focused on KV cache compression techniques, while the other provides a broader collection of KV cache research papers with corresponding implementations, allowing researchers to explore both specialized compression methods and the wider landscape of KV cache optimizations together.

Maintenance 10/25
Adoption 10/25
Maturity 16/25
Community 11/25
Maintenance 0/25
Adoption 10/25
Maturity 16/25
Community 13/25
Stars: 668
Forks: 22
Downloads:
Commits (30d): 0
Language:
License: MIT
Stars: 417
Forks: 26
Downloads:
Commits (30d): 0
Language:
License: GPL-3.0
No Package No Dependents
Stale 6m No Package No Dependents

About Awesome-KV-Cache-Compression

October2001/Awesome-KV-Cache-Compression

📰 Must-read papers on KV Cache Compression (constantly updating 🤗).

Curates papers and implementations spanning pruning, quantization, and distillation approaches for reducing KV cache memory consumption in LLMs, with links to referenced codebases like kvpress and KVCache-Factory. Organizes methods by technique (sparse attention, token eviction, low-rank decomposition) and includes recent survey papers covering KV cache optimization strategies across inference frameworks. Integrates with Hugging Face transformers ecosystem and tracks active research implementations with GitHub repository references.

About Awesome-LLM-KV-Cache

Zefan-Cai/Awesome-LLM-KV-Cache

Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes.

Organizes research papers and implementations across nine specialized KV cache optimization categories—including compression, quantization, low-rank decomposition, and cross-layer utilization—enabling developers to track state-of-the-art inference acceleration techniques. Papers are mapped to their official implementations from research teams at DeepSeek, Microsoft, and others, with implementation links and recommendation ratings. The collection spans foundational work like StreamingLLM through recent advances in sparse attention and disaggregated serving architectures, targeting LLM inference optimization across various hardware and deployment scenarios.

Scores updated daily from GitHub, PyPI, and npm data. How scores work