LLM Inference Engines Transformer Models

Optimized inference engines and serving systems for deploying and running large language models efficiently. Focuses on throughput, latency, memory optimization, and production deployment. Does NOT include training frameworks, fine-tuning methods, quantization techniques, or model architecture implementations.

There are 153 llm inference engines models tracked. 7 score above 70 (verified tier). The highest-rated is vllm-project/vllm at 100/100 with 73,007 stars and 7,953,905 monthly downloads. 10 of the top 10 are actively maintained.

Get all 153 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-inference-engines&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	vllm-project/vllm A high-throughput and memory-efficient inference and serving engine for LLMs	100	Verified	73,007	Python
2	sgl-project/sglang SGLang is a high-performance serving framework for large language models and...	100	Verified	24,410	Python
3	alibaba/MNN MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba,...	93	Verified	14,526	C++
4	xorbitsai/inference Swap GPT for any LLM by changing a single line of code. Xinference lets you...	89	Verified	9,129	Python
5	tensorzero/tensorzero TensorZero is an open-source stack for industrial-grade LLM applications. It...	89	Verified	11,080	Rust
6	ARahim3/mlx-tune Bringing the Unsloth experience to Mac users via Apple's MLX framework	75	Verified	733	Python
7	gpustack/gpustack Performance-optimized AI inference on your GPUs. Unlock superior throughput...	71	Verified	4,630	Python
8	tenstorrent/tt-metal :metal: TT-NN operator library, and TT-Metalium low level kernel programming model.	69	Established	1,379	C++
9	InternLM/lmdeploy LMDeploy is a toolkit for compressing, deploying, and serving LLMs.	68	Established	7,680	Python
10	ModelTC/LightLLM LightLLM is a Python-based LLM (Large Language Model) inference and serving...	68	Established	3,944	Python
11	jd-opensource/xllm A high-performance inference engine for LLMs, optimized for diverse AI accelerators.	66	Established	1,081	C++
12	alibaba/rtp-llm RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.	66	Established	1,065	Cuda
13	bigscience-workshop/petals 🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x...	63	Established	9,997	Python
14	FastFlowLM/FastFlowLM Run LLMs on AMD Ryzen™ AI NPUs in minutes. Just like Ollama - but...	59	Established	942	C++
15	zhihu/ZhiLight A highly optimized LLM inference acceleration engine for Llama and its variants.	59	Established	905	C++
16	NexaAI/nexa-sdk Run frontier LLMs and VLMs with day-0 model support across GPU, NPU, and...	57	Established	7,797	Kotlin
17	NVIDIA-NeMo/Automodel Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging...	56	Established	366	Python
18	Tiiny-AI/PowerInfer High-speed Large Language Model Serving for Local Deployment	54	Established	8,808	C++
19	underneathall/pinferencia Python + Inference - Model Deployment library in Python. Simplest model...	53	Established	545	Python
20	GeeeekExplorer/nano-vllm Nano vLLM	53	Established	12,189	Python
21	ai-decentralized/BloomBee Decentralized LLMs fine-tuning and inference with offloading	51	Established	111	Python
22	higgsfield-ai/higgsfield Fault-tolerant, highly scalable GPU orchestration, and a machine learning...	49	Emerging	3,558	Jupyter Notebook
23	intel/ipex-llm Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM,...	49	Emerging	8,724	Python
24	AI-Hypercomputer/JetStream JetStream is a throughput and memory optimized engine for LLM inference on...	48	Emerging	415	Python
25	toverainc/willow-inference-server Open source, local, and self-hosted highly optimized language inference...	47	Emerging	495	Python
26	microsoft/sarathi-serve A low-latency & high-throughput serving engine for LLMs	47	Emerging	482	Python
27	alibaba/InferSim A Lightweight LLM Inference Performance Simulator	46	Emerging	65	Python
28	slwang-ustc/nano-vllm-v1 Nano vLLM with vLLM v1's request scheduling strategy and chunked prefill	46	Emerging	61	Python
29	livepeer/ai-runner Inference runtime for running different batch and real-time AI pipelines.	46	Emerging	25	Python
30	Deep-Spark/DeepSparkInference DeepSparkInference has selected 216 inference models of both small and large...	45	Emerging	28	Python
31	zhenye234/LLaSA_training LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis	45	Emerging	659	Python
32	microsoft/vidur A large-scale simulation framework for LLM inference	45	Emerging	547	Python
33	inclusionAI/asystem-awex A high-performance RL training-inference weight synchronization framework,...	44	Emerging	138	Python
34	kennethleungty/Llama-2-Open-Source-LLM-CPU-Inference Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A	44	Emerging	974	Python
35	vitoplantamura/OnnxStream Lightweight inference library for ONNX files, written in C++. It can run...	44	Emerging	2,031	C++
36	jina-ai/rungpt An open-source cloud-native of large multi-modal models (LMMs) serving framework.	43	Emerging	165	Python
37	Troyanovsky/Local-LLM-Comparison-Colab-UI Compare the performance of different LLM that can be deployed locally on...	43	Emerging	1,100	Jupyter Notebook
38	PureBee/purebee A GPU defined in software. Runs Llama 3.2 1B at 3.6 tok/sec. Zero dependencies.	41	Emerging	22	JavaScript
39	SearchSavior/OpenArc Inference engine for Intel devices. Serve LLMs, VLMs, Whisper, Kokoro-TTS,...	41	Emerging	341	Python
40	vectorch-ai/ScaleLLM A high-performance inference system for large language models, designed for...	40	Emerging	491	C++
41	bytedance/byteir A model compilation solution for various hardware	39	Emerging	465	MLIR
42	MegEngine/InferLLM a lightweight LLM model inference framework	39	Emerging	747	C++
43	RWKV/rwkv.cpp INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model	38	Emerging	1,563	C++
44	zejia-lin/BulletServe Boosting GPU utilization for LLM serving via dynamic spatial-temporal...	38	Emerging	37	Python
45	AI-Hypercomputer/jetstream-pytorch PyTorch/XLA integration with JetStream (https://github.com/google/JetStream)...	37	Emerging	79	Python
46	andrewkchan/deepseek.cpp CPU inference for the DeepSeek family of large language models in C++	37	Emerging	315	C++
47	powerserve-project/PowerServe High-speed and easy-use LLM serving framework for local deployment	37	Emerging	146	C++
48	1b5d/llm-api Run any Large Language Model behind a unified API	37	Emerging	171	Python
49	interestingLSY/swiftLLM A tiny yet powerful LLM inference system tailored for researching purpose....	37	Emerging	320	Python
50	SqueezeAILab/LLMCompiler [ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Calling	37	Emerging	1,828	Python
51	chenmozhijin/BSRoformer.cpp GGML-based C++ inference for BS Roformer/Mel-Band-Roformer vocal separation...	36	Emerging	8	C++
52	modelscope/dash-infer DashInfer is a native LLM inference engine aiming to deliver...	36	Emerging	273	C
53	invergent-ai/surogate Insanely fast LLM pre-training and fine-tuning for modern NVIDIA GPUs....	36	Emerging	114	C++
54	jdaln/dgx-spark-inference-stack Serve the home! Inference stack for your Nvidia DGX Spark aka the Grace...	36	Emerging	26	JavaScript
55	vivy-yi/awesome-llm-training-inference Curated list of LLM training and inference frameworks, tools, and resources....	35	Emerging	1	—
56	toyaix/TritonLLM LLM Inference via Triton (Flexible & Modular): Focused on Kernel...	35	Emerging	76	Python
57	Azure99/BlossomData A fluent, scalable, and easy-to-use LLM data processing framework.	35	Emerging	28	Python
58	jankais3r/LLaMA_MPS Run LLaMA (and Stanford-Alpaca) inference on Apple Silicon GPUs.	35	Emerging	585	Python
59	TrevTron/indiedroid-nova-llm Running Llama 3.1 8B and other LLMs on RK3588 NPU - benchmarks and setup guides	34	Emerging	3	Python
60	thruthseeker/LionLock_FDE_OSS Open source fatigue detection engine for large language models with trust overlay	34	Emerging	3	Python
61	aniketmaurya/llm-inference Large Language Model (LLM) Inference API and Chatbot	33	Emerging	127	Python
62	hpcaitech/SwiftInfer Efficient AI Inference & Serving	32	Emerging	480	Python
63	MrYxJ/calculate-flops.pytorch The calflops is designed to calculate FLOPs、MACs and Parameters in all...	32	Emerging	927	Python
64	nareshis21/Truelarge-RT Android inference engine running 20B+ parameter LLMs on 4GB-8GB RAM devices....	32	Emerging	9	Kotlin
65	riccardomusmeci/mlx-llm Large Language Models (LLMs) applications and tools running on Apple Silicon...	32	Emerging	459	Python
66	James-QiuHaoran/LLM-serving-with-proxy-models Efficient Interactive LLM Serving with Proxy Model-based Sequence Length...	32	Emerging	49	Jupyter Notebook
67	efeslab/Nanoflow A throughput-oriented high-performance serving framework for LLMs	31	Emerging	949	Jupyter Notebook
68	argonne-lcf/LLM-Inference-Bench LLM-Inference-Bench	31	Emerging	60	Jupyter Notebook
69	AmpereComputingAI/llama.cpp Ampere optimized llama.cpp	31	Emerging	33	Python
70	CoderLSF/fast-llama Runs LLaMA with Extremely HIGH speed	30	Emerging	95	C++
71	andrewkchan/yalm Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O	30	Emerging	557	C++
72	tommasocerruti/detllm Deterministic-mode checks for LLM inference: measure run/batch variance,...	30	Emerging	18	Python
73	knagrecha/saturn Saturn accelerates the training of large-scale deep learning models with a...	30	Emerging	24	Python
74	zRzRzRzRzRzRzR/lm-fly 大模型推理框架加速，让 LLM 飞起来	30	Emerging	24	Python
75	rbitr/llm.f90 LLM inference in Fortran	30	Emerging	64	Fortran
76	yingding/applyllm A python package for applying LLM with LangChain and Hugging Face on local...	30	Emerging	2	Jupyter Notebook
77	ShinoharaHare/LLM-Training A distributed training framework for large language models powered by Lightning.	30	Emerging	24	Python
78	gunnarnordqvist/opencode-context-filter Transparent HTTP proxy that automatically filters repository context for...	29	Experimental	2	Python
79	gotzmann/booster Booster - open accelerator for LLM models. Better inference and debugging...	29	Experimental	167	C++
80	AshishGautamX/K8s-LLM-Scheduler An intelligent Kubernetes scheduler powered by Meta's Llama-3.3-70B model...	29	Experimental	2	Python
81	psmarter/mini-infer A high-performance LLM inference engine with PagedAttention \|...	29	Experimental	61	Python
82	moeru-ai/demodel 🚀🛸 Easily boost the speed of pulling your models and datasets from various...	29	Experimental	10	Go
83	m0dulo/InferSpore 🌱 A fully independent Large Language Model (LLM) inference engine, built...	29	Experimental	32	Cuda
84	m-horky/sllm Tools using small Large Language Models	29	Experimental	4	Python
85	lucasjinreal/Namo-R1 A CPU Realtime VLM in 500M. Surpassed Moondream2 and SmolVLM. Training from...	28	Experimental	252	Python
86	KarthikSriramGit/H.E.I.M.D.A.L.L H.E.I.M.D.A.L.L looks at fleet telemetry and gives you natural-language...	28	Experimental	18	Jupyter Notebook
87	alibaba/easydist Automated Parallelization System and Infrastructure for Multiple Ecosystems	28	Experimental	82	Python
88	winstxnhdw/llm-api A fast CPU-based API for Qwen 2.5 using CTranslate2, hosted on Hugging Face Spaces.	28	Experimental	1	Python
89	jmaczan/tiny-vllm High performance LLM inference engine, a younger sibling of vLLM	27	Experimental	12	C++
90	RahulSChand/gpu_poor Calculate token/s & GPU memory requirement for any LLM. Supports...	27	Experimental	1,396	JavaScript
91	dengls24/LLM-para Analyze LLM inference: FLOPs, memory, Roofline model. Supports GQA, MoE,...	26	Experimental	10	Python
92	BenChaliah/NVFP4-on-4090-vLLM AdaLLM is an NVFP4-first inference runtime for Ada Lovelace (RTX 4090) with...	26	Experimental	98	Python
93	ToddThomson/Mila Achilles Mila Deep Neural Network library provides a comprehensive API to...	26	Experimental	7	C++
94	HyperMink/inferenceable Scalable AI Inference Server for CPU and GPU with Node.js \| Utilizes...	25	Experimental	15	JavaScript
95	ybubnov/metalchat Pure C++23 Llama inference for Apple Silicon chips	25	Experimental	19	C++
96	kennethleungty/DeepSeek-R1-Ollama-Simple-Evals Run and Evaluate DeepSeek-R1 Distilled Models Locally with Ollama and...	25	Experimental	2	Jupyter Notebook
97	harleyszhang/llm_counts llm theoretical performance analysis tools and support params, flops, memory...	24	Experimental	115	Python
98	titanml/takeoff-community TitanML Takeoff Server is an optimization, compression and deployment...	24	Experimental	114	—
99	bpevangelista/vfastml Inference and Training Engine for LLMs, Image2Image and Other Models	24	Experimental	3	Python
100	Relaxed-System-Lab/HexGen [ICML 2024] Serving LLMs on heterogeneous decentralized clusters.	24	Experimental	34	Python
101	KevinLee1110/dynamic-batching The official repo for the paper "Optimizing LLM Inference Throughput via...	24	Experimental	17	—
102	mjglatzmaier/llm-boostrap Starter repo for running local LLM inference and lightweight benchmarking on...	23	Experimental	1	Python
103	HelpingAI/inferno Run Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and other...	23	Experimental	8	Python
104	quantumnic/ssd-llm Run 70B+ LLMs on Apple Silicon by using SSD as extended memory — intelligent...	23	Experimental	1	Rust
105	llm-works/llm-infer LLM inference server with native, vLLM, and Ollama backends, including a...	22	Experimental	—	Python
106	VPanjeta/PyLLaMa-CPU Fast LLaMa inference on CPU using llama.cpp for Python	22	Experimental	9	C
107	deepagency/llm-resource-planner A simple CLI tool to fetch Hugging Face model metadata and estimate required...	22	Experimental	—	Python
108	TeamADAPT/blitzkernels BlitzKernels — production WASM inference kernels for edge AI (embedding,...	22	Experimental	—	Rust
109	onlychara553-debug/dgx-spark-inference-stack 🚀 Serve large language models efficiently at home with this Docker-based...	22	Experimental	—	JavaScript
110	liam8421/faster-llm 🚀 Accelerate LLM training with Fast-LLM, an open-source library for...	22	Experimental	—	Python
111	MonitooDev/indiedroid-nova-llm 🚀 Benchmark local LLMs like Llama 3.1 on the Indiedroid Nova with RK3588...	22	Experimental	—	Python
112	changwoolee/BLAST [NeurIPS 2024] BLAST: Block Level Adaptive Structured Matrix for Efficient...	20	Experimental	17	Python
113	modelize-ai/LLM-Inference-Deployment-Tutorial Tutorial for LLM developers about engine design, service deployment,...	20	Experimental	19	Python
114	rafaelmaza/llmfit-web Find the best open-source LLM for your GPU/RAM - fit, speed & quality...	20	Experimental	1	JavaScript
115	AntonioVFranco/elamonica Production-ready test-time compute optimization framework for LLM inference....	20	Experimental	1	Python
116	CornelisKuijpers/SIP-interface Run 400B+ parameter AI models on consumer hardware with 12GB RAM	19	Experimental	—	—
117	landry-some/LLM-streaming Efficient streaming inference for large language models (LLMs).	19	Experimental	—	Python
118	darxkies/cpu-slm A holiday project to better understand the inner workings of SLM/LLM.	19	Experimental	—	Rust
119	johnbrodowski/AutoInferenceBenchmark AutoInferenceBenchmark is a Windows desktop application for evaluating and...	19	Experimental	—	C#
120	Artemarius/CuInfer From-scratch LLM inference engine in C++17/CUDA. Custom kernels, GGUF model...	19	Experimental	—	Cuda
121	ThalesMMS/sglang-config Configuration files and deployment scripts for serving Llama 3.2 3B and Qwen...	19	Experimental	—	Shell
122	EmbeddedLLM/embeddedllm EmbeddedLLM: API server for Embedded Device Deployment. Currently support...	18	Experimental	51	Python
123	piotrmaciejbednarski/llm-inference-tampering Proof-of-concept for persistent manipulation of LLM outputs by modifying...	18	Experimental	5	Python
124	datvodinh/serve-llm Serve high throughput and scalable LLM using Ray and vLLM	18	Experimental	3	Makefile
125	tensorchord/inference-benchmark Benchmark for machine learning model online serving (LLM, embedding,...	17	Experimental	28	Python
126	GPUforLLM/llm-vram-calculator Accurate VRAM calculator for Local LLMs (Llama 4, DeepSeek V3, Qwen 2.5)....	17	Experimental	2	HTML
127	nitrictech/pycasts A text to Podcast inference API	17	Experimental	5	Python
128	Meahg/exvllm 🚀 Enhance vllm with exvllm to utilize MOE mixed inference, enabling...	16	Experimental	2	C++
129	ictnlp/SiLLM SiLLM is a Simultaneous Machine Translation (SiMT) Framework. It utilizes a...	16	Experimental	18	Python
130	isshiki-dev/docker-model-runner Self-hosted Anthropic API Compatible Inference Server with Claude Code...	16	Experimental	1	Python
131	arkodeepsen/helix Professional training stack for 100M parameter language models optimized for...	15	Experimental	5	Python
132	AMD-AGI/gpt-fast The GPT-Fast for Multimodal Models on AMD GPUs	15	Experimental	6	Python
133	virtualramblas/DFloat11_MPS DFloat11 for Apple Silicon.	15	Experimental	—	Python
134	rajatady/Inference-Stack Production-grade LLM inference API built from scratch. NestJS gateway +...	15	Experimental	1	TypeScript
135	Scieries-Reunies-de-l-Est/llm LLM deployment api of the Service Commercial company.	15	Experimental	—	Python
136	1337hero/rx7900xtx-llama-bench-rocm Benchmark script for llama.cpp & results for AMD RX 7900 XTX	15	Experimental	—	Shell
137	SunayHegde2006/Air.rs Air.rs 70B+ inference on consumer GPU, LLM inference in Rust	15	Experimental	1	Rust
138	adamydwang/mobilellama a lightweight C++ LLaMA inference engine for mobile devices	15	Experimental	15	C++
139	rick97julho/do-i-have-the-vram 🔍 Estimate your VRAM needs for Hugging Face models in seconds without...	14	Experimental	—	Python
140	vishvaRam/Docker-vLLM-Server-Builder-Runpod Production-grade, OpenAI-compatible server using vLLM v0.17.0. Deploy LLMs,...	14	Experimental	—	Shell
141	joeddav/illustrated-training-cluster [WIP] Interactive visualization of LLM training parallelism across GPU clusters	14	Experimental	—	TypeScript
142	iNeil77/vllm-code-harness Run code inference-only benchmarks quickly using vLLM	14	Experimental	9	Python
143	X-rayLaser/DistributedLLM Run LLM inference by spliting models into parts and hosting each part on a...	13	Experimental	8	Python
144	rinoScremin/Open_Cluster_AI_Station_beta High-performance distributed matrix computation for AI workloads. Supports...	12	Experimental	1	C++
145	getflexai/flex_ai simplifies fine-tuning and inference for 60+ open-source LLMs through a single API	12	Experimental	3	Python
146	eniompw/llama-cpp-gpu Load larger models by offloading model layers to both GPU and CPU	12	Experimental	3	Jupyter Notebook
147	EvanZhuang/rocm_tips Tips for building and using DL packages for AMD ROCM	11	Experimental	2	—
148	karun2328/llm_serving_benchmarks Benchmarking LLM inference serving with vLLM, analyzing latency, throughput,...	11	Experimental	—	Python
149	virtualramblas/FlexLLMGenMPS Running large language models on a single M1/M2 GPU for throughput-oriented...	11	Experimental	—	Python
150	ZeeetOne/llm-inference-deployment Practical example of deploying fine-tuned LLMs locally with FastAPI....	11	Experimental	—	Python
151	G-B-KEVIN-ARJUN/runtime-inference "Faster AI: Accelerating Qwen 2.5 from 7 t/s to 82 t/s on a single RTX 4060...	11	Experimental	—	Python
152	KT313/assistant_base A custom framework for easy use of LLMs, VLMs, etc. supporting various modes...	10	Experimental	1	Jupyter Notebook
153	di-osc/osc-llm 轻量级大模型推理引擎	10	Experimental	3	Python

Comparisons in this category

sglang and vllm (100 vs 100) vllm and MNN (100 vs 93) vllm and nano-vllm (100 vs 53) vllm and inference (100 vs 89) vllm and PowerInfer (100 vs 54) vllm and gpustack (100 vs 71) vllm and LightLLM (100 vs 68) vllm and xllm (100 vs 66) vllm and rtp-llm (100 vs 66) vllm and ZhiLight (100 vs 59)