vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
Implements PagedAttention for efficient KV cache management and continuous request batching to maximize GPU utilization. Supports multiple quantization schemes (GPTQ, AWQ, INT4/8, FP8), speculative decoding, and tensor/pipeline parallelism across NVIDIA, AMD, Intel, and TPU hardware. Provides OpenAI-compatible API endpoints and integrates directly with Hugging Face models, including multi-modal and mixture-of-expert architectures.
73,007 stars and 7,953,905 monthly downloads. Used by 43 other packages. Actively maintained with 996 commits in the last 30 days. Available on PyPI.
Stars
73,007
Forks
14,312
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 13, 2026
Monthly downloads
7,953,905
Commits (30d)
996
Dependencies
68
Reverse dependents
43
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/vllm-project/vllm"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Compare
Related models
sgl-project/sglang
SGLang is a high-performance serving framework for large language models and multimodal models.
alibaba/MNN
MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba, powering...
xorbitsai/inference
Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source,...
tensorzero/tensorzero
TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM...
ARahim3/mlx-tune
Bringing the Unsloth experience to Mac users via Apple's MLX framework