LLM Inference Engines LLM Tools

High-performance inference frameworks and engines optimized for deploying and serving LLMs efficiently across various hardware accelerators and resource-constrained devices. Does NOT include LLM training frameworks, fine-tuning tools, or application-level chatbot/UI wrappers.

There are 35 llm inference engines tools tracked. 1 score above 70 (verified tier). The highest-rated is kvcache-ai/Mooncake at 72/100 with 4,911 stars. 6 of the top 10 are actively maintained.

Get all 35 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-inference-engines&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	kvcache-ai/Mooncake Mooncake is the serving platform for Kimi, a leading LLM service provided by...	72	Verified	4,911	C++
2	vllm-project/vllm-ascend Community maintained hardware plugin for vLLM on Ascend	69	Established	1,773	C++
3	SemiAnalysisAI/InferenceX Open Source Continuous Inference Benchmarking Qwen3.5, DeepSeek, GPTOSS -...	66	Established	655	Python
4	sophgo/tpu-mlir Machine learning compiler based on MLIR for Sophgo TPU.	64	Established	872	C++
5	uccl-project/uccl UCCL is an efficient communication library for GPUs, covering collectives,...	64	Established	1,234	C++
6	BBuf/how-to-optim-algorithm-in-cuda how to optimize some algorithm in cuda.	58	Established	2,863	Cuda
7	RightNow-AI/picolm Run a 1-billion parameter LLM on a $10 board with 256MB RAM	50	Established	1,364	C
8	jinbooooom/ai-infra-hpc hpc 教程，包含集合通信(mpi、nccl)、cuda 编程、向量化 SIMD、RDMA 通信等	47	Emerging	321	Cuda
9	zjhellofss/KuiperLLama 校招、秋招、春招、实习好项目，带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。	42	Emerging	509	C++
10	RayFernando1337/LLM-Calc Instantly calculate the maximum size of quantized language models that can...	39	Emerging	253	TypeScript
11	erans/selfhostllm A web-based calculator for estimating GPU memory requirements and maximum...	36	Emerging	37	HTML
12	amirgholami/ai_and_memory_wall AI and Memory Wall	35	Emerging	226	—
13	bd4sur/Nano 电子鹦鹉 / Toy Language Model	34	Emerging	264	C
14	ChiefGyk3D/FrankenLLM Stitched-together GPUs, but it lives! Run different LLM models optimally...	32	Emerging	9	Shell
15	FilipFan/PolyEngineInfer Run LLM inference in an Android app with llama.cpp, ExecuTorch, LiteRT,...	28	Experimental	7	Kotlin
16	PrajwalNeeralagi/nano-vllm 🚀 Implement fast offline inference with Nano-vLLM, a lightweight and...	24	Experimental	—	Python
17	Alex188dot/GPU-VRAM-Calculator A simple tool to find out GPU VRAM requirements for running LLMs	24	Experimental	4	HTML
18	refinefuture-ai/refft.cpp A new approach of running LLM/LMs' inference/training on GPU/NPU backends...	23	Experimental	1	—
19	Jugurthakebaili1/vLLM-Kunlun 🛠 Enhance vLLM performance on Kunlun XPU with this hardware plugin, offering...	22	Experimental	—	Python
20	manishklach/SRMIC_X1 Analytical simulator for SRMIC — a residency-first LLM inference accelerator...	22	Experimental	—	SystemVerilog
21	darekhta/marmot High-performance LLM inference engine in C23 with CPU and Metal backends,...	22	Experimental	—	C
22	George614/gpu-mem-calculator GPU Memory Calculator for LLM Training - Calculate GPU memory requirements...	22	Experimental	3	Python
23	dwain-barnes/LLM-GGUF-Auto-Converter Automated Jupyter notebook solution for batch converting Large Language...	18	Experimental	4	Jupyter Notebook
24	hofong428/Optimizing-GPU-Kernels LLM Serving & Inference Optimization	18	Experimental	8	—
25	simar-rekhi/triton LLM-assisted compiler pass generation with Triton & CUDA	16	Experimental	1	Jupyter Notebook
26	NEBUL-AI/HF-VRAM-Extension VRAM calculator for Hugging Face models	15	Experimental	5	JavaScript
27	r3tr056/loc-ai-ly Locaily - Making Large Language Model Inference Accessible on Consumer Hardware	15	Experimental	1	C++
28	Pyrolignic-paydirt84/pse-vcipher-collapse Accelerate LLM inference by collapsing attention paths with...	14	Experimental	—	C
29	soy-tuber/localllama-insights Technical insights from r/LocalLLaMA — vLLM, FP8, NVFP4, Blackwell GPU...	14	Experimental	—	—
30	MetaxisResearch/parallax Distributed inference across heterogeneous hardware.	14	Experimental	—	Python
31	LessUp/hetero-paged-infer PagedAttention + Continuous Batching Inference Engine Prototype (Rust):...	14	Experimental	—	Rust
32	jbenongftw/gpu-perf-engineering-resources 🚀 Master GPU kernel programming and optimization for high-performance AI...	14	Experimental	—	—
33	jman4162/Sizing-AI-Training-by-Cost-per-Memory-Bandwidth A practical model (with math + Python) to tell if you’re compute-, memory-,...	11	Experimental	—	Jupyter Notebook
34	elibutters/CascadeInference Cascade based inference for LLMs	11	Experimental	—	Python
35	Alexyskoutnev/TurboInference Welcome to TurboInference, a high-performance inference toolkit written in...	10	Experimental	1	—