LLM Inference Engines LLM Tools

High-performance inference frameworks and engines optimized for deploying and serving LLMs efficiently across various hardware accelerators and resource-constrained devices. Does NOT include LLM training frameworks, fine-tuning tools, or application-level chatbot/UI wrappers.

There are 35 llm inference engines tools tracked. 1 score above 70 (verified tier). The highest-rated is kvcache-ai/Mooncake at 72/100 with 4,911 stars. 6 of the top 10 are actively maintained.

Get all 35 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-inference-engines&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 kvcache-ai/Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by...

72
Verified
2 vllm-project/vllm-ascend

Community maintained hardware plugin for vLLM on Ascend

69
Established
3 SemiAnalysisAI/InferenceX

Open Source Continuous Inference Benchmarking Qwen3.5, DeepSeek, GPTOSS -...

66
Established
4 sophgo/tpu-mlir

Machine learning compiler based on MLIR for Sophgo TPU.

64
Established
5 uccl-project/uccl

UCCL is an efficient communication library for GPUs, covering collectives,...

64
Established
6 BBuf/how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

58
Established
7 RightNow-AI/picolm

Run a 1-billion parameter LLM on a $10 board with 256MB RAM

50
Established
8 jinbooooom/ai-infra-hpc

hpc 教程,包含集合通信(mpi、nccl)、cuda 编程、向量化 SIMD、RDMA 通信等

47
Emerging
9 zjhellofss/KuiperLLama

校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。

42
Emerging
10 RayFernando1337/LLM-Calc

Instantly calculate the maximum size of quantized language models that can...

39
Emerging
11 erans/selfhostllm

A web-based calculator for estimating GPU memory requirements and maximum...

36
Emerging
12 amirgholami/ai_and_memory_wall

AI and Memory Wall

35
Emerging
13 bd4sur/Nano

电子鹦鹉 / Toy Language Model

34
Emerging
14 ChiefGyk3D/FrankenLLM

Stitched-together GPUs, but it lives! Run different LLM models optimally...

32
Emerging
15 FilipFan/PolyEngineInfer

Run LLM inference in an Android app with llama.cpp, ExecuTorch, LiteRT,...

28
Experimental
16 PrajwalNeeralagi/nano-vllm

🚀 Implement fast offline inference with Nano-vLLM, a lightweight and...

24
Experimental
17 Alex188dot/GPU-VRAM-Calculator

A simple tool to find out GPU VRAM requirements for running LLMs

24
Experimental
18 refinefuture-ai/refft.cpp

A new approach of running LLM/LMs' inference/training on GPU/NPU backends...

23
Experimental
19 Jugurthakebaili1/vLLM-Kunlun

🛠 Enhance vLLM performance on Kunlun XPU with this hardware plugin, offering...

22
Experimental
20 manishklach/SRMIC_X1

Analytical simulator for SRMIC — a residency-first LLM inference accelerator...

22
Experimental
21 darekhta/marmot

High-performance LLM inference engine in C23 with CPU and Metal backends,...

22
Experimental
22 George614/gpu-mem-calculator

GPU Memory Calculator for LLM Training - Calculate GPU memory requirements...

22
Experimental
23 dwain-barnes/LLM-GGUF-Auto-Converter

Automated Jupyter notebook solution for batch converting Large Language...

18
Experimental
24 hofong428/Optimizing-GPU-Kernels

LLM Serving & Inference Optimization

18
Experimental
25 simar-rekhi/triton

LLM-assisted compiler pass generation with Triton & CUDA

16
Experimental
26 NEBUL-AI/HF-VRAM-Extension

VRAM calculator for Hugging Face models

15
Experimental
27 r3tr056/loc-ai-ly

Locaily - Making Large Language Model Inference Accessible on Consumer Hardware

15
Experimental
28 Pyrolignic-paydirt84/pse-vcipher-collapse

Accelerate LLM inference by collapsing attention paths with...

14
Experimental
29 soy-tuber/localllama-insights

Technical insights from r/LocalLLaMA — vLLM, FP8, NVFP4, Blackwell GPU...

14
Experimental
30 MetaxisResearch/parallax

Distributed inference across heterogeneous hardware.

14
Experimental
31 LessUp/hetero-paged-infer

PagedAttention + Continuous Batching Inference Engine Prototype (Rust):...

14
Experimental
32 jbenongftw/gpu-perf-engineering-resources

🚀 Master GPU kernel programming and optimization for high-performance AI...

14
Experimental
33 jman4162/Sizing-AI-Training-by-Cost-per-Memory-Bandwidth

A practical model (with math + Python) to tell if you’re compute-, memory-,...

11
Experimental
34 elibutters/CascadeInference

Cascade based inference for LLMs

11
Experimental
35 Alexyskoutnev/TurboInference

Welcome to TurboInference, a high-performance inference toolkit written in...

10
Experimental