vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

100
/ 100
Verified

Implements PagedAttention for efficient KV cache management and continuous request batching to maximize GPU utilization. Supports multiple quantization schemes (GPTQ, AWQ, INT4/8, FP8), speculative decoding, and tensor/pipeline parallelism across NVIDIA, AMD, Intel, and TPU hardware. Provides OpenAI-compatible API endpoints and integrates directly with Hugging Face models, including multi-modal and mixture-of-expert architectures.

73,007 stars and 7,953,905 monthly downloads. Used by 43 other packages. Actively maintained with 996 commits in the last 30 days. Available on PyPI.

Maintenance 25 / 25
Adoption 25 / 25
Maturity 25 / 25
Community 25 / 25

How are scores calculated?

Stars

73,007

Forks

14,312

Language

Python

License

Apache-2.0

Last pushed

Mar 13, 2026

Monthly downloads

7,953,905

Commits (30d)

996

Dependencies

68

Reverse dependents

43

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/vllm-project/vllm"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.