LLM Inference Engines LLM Tools
High-performance inference frameworks and engines optimized for deploying and serving LLMs efficiently across various hardware accelerators and resource-constrained devices. Does NOT include LLM training frameworks, fine-tuning tools, or application-level chatbot/UI wrappers.
There are 35 llm inference engines tools tracked. 1 score above 70 (verified tier). The highest-rated is kvcache-ai/Mooncake at 72/100 with 4,911 stars. 6 of the top 10 are actively maintained.
Get all 35 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-inference-engines&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
kvcache-ai/Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by... |
|
Verified |
| 2 |
vllm-project/vllm-ascend
Community maintained hardware plugin for vLLM on Ascend |
|
Established |
| 3 |
SemiAnalysisAI/InferenceX
Open Source Continuous Inference Benchmarking Qwen3.5, DeepSeek, GPTOSS -... |
|
Established |
| 4 |
sophgo/tpu-mlir
Machine learning compiler based on MLIR for Sophgo TPU. |
|
Established |
| 5 |
uccl-project/uccl
UCCL is an efficient communication library for GPUs, covering collectives,... |
|
Established |
| 6 |
BBuf/how-to-optim-algorithm-in-cuda
how to optimize some algorithm in cuda. |
|
Established |
| 7 |
RightNow-AI/picolm
Run a 1-billion parameter LLM on a $10 board with 256MB RAM |
|
Established |
| 8 |
jinbooooom/ai-infra-hpc
hpc 教程,包含集合通信(mpi、nccl)、cuda 编程、向量化 SIMD、RDMA 通信等 |
|
Emerging |
| 9 |
zjhellofss/KuiperLLama
校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。 |
|
Emerging |
| 10 |
RayFernando1337/LLM-Calc
Instantly calculate the maximum size of quantized language models that can... |
|
Emerging |
| 11 |
erans/selfhostllm
A web-based calculator for estimating GPU memory requirements and maximum... |
|
Emerging |
| 12 |
amirgholami/ai_and_memory_wall
AI and Memory Wall |
|
Emerging |
| 13 |
bd4sur/Nano
电子鹦鹉 / Toy Language Model |
|
Emerging |
| 14 |
ChiefGyk3D/FrankenLLM
Stitched-together GPUs, but it lives! Run different LLM models optimally... |
|
Emerging |
| 15 |
FilipFan/PolyEngineInfer
Run LLM inference in an Android app with llama.cpp, ExecuTorch, LiteRT,... |
|
Experimental |
| 16 |
PrajwalNeeralagi/nano-vllm
🚀 Implement fast offline inference with Nano-vLLM, a lightweight and... |
|
Experimental |
| 17 |
Alex188dot/GPU-VRAM-Calculator
A simple tool to find out GPU VRAM requirements for running LLMs |
|
Experimental |
| 18 |
refinefuture-ai/refft.cpp
A new approach of running LLM/LMs' inference/training on GPU/NPU backends... |
|
Experimental |
| 19 |
Jugurthakebaili1/vLLM-Kunlun
🛠 Enhance vLLM performance on Kunlun XPU with this hardware plugin, offering... |
|
Experimental |
| 20 |
manishklach/SRMIC_X1
Analytical simulator for SRMIC — a residency-first LLM inference accelerator... |
|
Experimental |
| 21 |
darekhta/marmot
High-performance LLM inference engine in C23 with CPU and Metal backends,... |
|
Experimental |
| 22 |
George614/gpu-mem-calculator
GPU Memory Calculator for LLM Training - Calculate GPU memory requirements... |
|
Experimental |
| 23 |
dwain-barnes/LLM-GGUF-Auto-Converter
Automated Jupyter notebook solution for batch converting Large Language... |
|
Experimental |
| 24 |
hofong428/Optimizing-GPU-Kernels
LLM Serving & Inference Optimization |
|
Experimental |
| 25 |
simar-rekhi/triton
LLM-assisted compiler pass generation with Triton & CUDA |
|
Experimental |
| 26 |
NEBUL-AI/HF-VRAM-Extension
VRAM calculator for Hugging Face models |
|
Experimental |
| 27 |
r3tr056/loc-ai-ly
Locaily - Making Large Language Model Inference Accessible on Consumer Hardware |
|
Experimental |
| 28 |
Pyrolignic-paydirt84/pse-vcipher-collapse
Accelerate LLM inference by collapsing attention paths with... |
|
Experimental |
| 29 |
soy-tuber/localllama-insights
Technical insights from r/LocalLLaMA — vLLM, FP8, NVFP4, Blackwell GPU... |
|
Experimental |
| 30 |
MetaxisResearch/parallax
Distributed inference across heterogeneous hardware. |
|
Experimental |
| 31 |
LessUp/hetero-paged-infer
PagedAttention + Continuous Batching Inference Engine Prototype (Rust):... |
|
Experimental |
| 32 |
jbenongftw/gpu-perf-engineering-resources
🚀 Master GPU kernel programming and optimization for high-performance AI... |
|
Experimental |
| 33 |
jman4162/Sizing-AI-Training-by-Cost-per-Memory-Bandwidth
A practical model (with math + Python) to tell if you’re compute-, memory-,... |
|
Experimental |
| 34 |
elibutters/CascadeInference
Cascade based inference for LLMs |
|
Experimental |
| 35 |
Alexyskoutnev/TurboInference
Welcome to TurboInference, a high-performance inference toolkit written in... |
|
Experimental |