LLM Inference Engines Transformer Models
Optimized inference engines and serving systems for deploying and running large language models efficiently. Focuses on throughput, latency, memory optimization, and production deployment. Does NOT include training frameworks, fine-tuning methods, quantization techniques, or model architecture implementations.
There are 153 llm inference engines models tracked. 7 score above 70 (verified tier). The highest-rated is vllm-project/vllm at 100/100 with 73,007 stars and 7,953,905 monthly downloads. 10 of the top 10 are actively maintained.
Get all 153 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-inference-engines&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Model | Score | Tier |
|---|---|---|---|
| 1 |
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs |
|
Verified |
| 2 |
sgl-project/sglang
SGLang is a high-performance serving framework for large language models and... |
|
Verified |
| 3 |
alibaba/MNN
MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba,... |
|
Verified |
| 4 |
xorbitsai/inference
Swap GPT for any LLM by changing a single line of code. Xinference lets you... |
|
Verified |
| 5 |
tensorzero/tensorzero
TensorZero is an open-source stack for industrial-grade LLM applications. It... |
|
Verified |
| 6 |
ARahim3/mlx-tune
Bringing the Unsloth experience to Mac users via Apple's MLX framework |
|
Verified |
| 7 |
gpustack/gpustack
Performance-optimized AI inference on your GPUs. Unlock superior throughput... |
|
Verified |
| 8 |
tenstorrent/tt-metal
:metal: TT-NN operator library, and TT-Metalium low level kernel programming model. |
|
Established |
| 9 |
InternLM/lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLMs. |
|
Established |
| 10 |
ModelTC/LightLLM
LightLLM is a Python-based LLM (Large Language Model) inference and serving... |
|
Established |
| 11 |
jd-opensource/xllm
A high-performance inference engine for LLMs, optimized for diverse AI accelerators. |
|
Established |
| 12 |
alibaba/rtp-llm
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications. |
|
Established |
| 13 |
bigscience-workshop/petals
🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x... |
|
Established |
| 14 |
FastFlowLM/FastFlowLM
Run LLMs on AMD Ryzen™ AI NPUs in minutes. Just like Ollama - but... |
|
Established |
| 15 |
zhihu/ZhiLight
A highly optimized LLM inference acceleration engine for Llama and its variants. |
|
Established |
| 16 |
NexaAI/nexa-sdk
Run frontier LLMs and VLMs with day-0 model support across GPU, NPU, and... |
|
Established |
| 17 |
NVIDIA-NeMo/Automodel
Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging... |
|
Established |
| 18 |
Tiiny-AI/PowerInfer
High-speed Large Language Model Serving for Local Deployment |
|
Established |
| 19 |
underneathall/pinferencia
Python + Inference - Model Deployment library in Python. Simplest model... |
|
Established |
| 20 |
GeeeekExplorer/nano-vllm
Nano vLLM |
|
Established |
| 21 |
ai-decentralized/BloomBee
Decentralized LLMs fine-tuning and inference with offloading |
|
Established |
| 22 |
higgsfield-ai/higgsfield
Fault-tolerant, highly scalable GPU orchestration, and a machine learning... |
|
Emerging |
| 23 |
intel/ipex-llm
Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM,... |
|
Emerging |
| 24 |
AI-Hypercomputer/JetStream
JetStream is a throughput and memory optimized engine for LLM inference on... |
|
Emerging |
| 25 |
toverainc/willow-inference-server
Open source, local, and self-hosted highly optimized language inference... |
|
Emerging |
| 26 |
microsoft/sarathi-serve
A low-latency & high-throughput serving engine for LLMs |
|
Emerging |
| 27 |
alibaba/InferSim
A Lightweight LLM Inference Performance Simulator |
|
Emerging |
| 28 |
slwang-ustc/nano-vllm-v1
Nano vLLM with vLLM v1's request scheduling strategy and chunked prefill |
|
Emerging |
| 29 |
livepeer/ai-runner
Inference runtime for running different batch and real-time AI pipelines. |
|
Emerging |
| 30 |
Deep-Spark/DeepSparkInference
DeepSparkInference has selected 216 inference models of both small and large... |
|
Emerging |
| 31 |
zhenye234/LLaSA_training
LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis |
|
Emerging |
| 32 |
microsoft/vidur
A large-scale simulation framework for LLM inference |
|
Emerging |
| 33 |
inclusionAI/asystem-awex
A high-performance RL training-inference weight synchronization framework,... |
|
Emerging |
| 34 |
kennethleungty/Llama-2-Open-Source-LLM-CPU-Inference
Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A |
|
Emerging |
| 35 |
vitoplantamura/OnnxStream
Lightweight inference library for ONNX files, written in C++. It can run... |
|
Emerging |
| 36 |
jina-ai/rungpt
An open-source cloud-native of large multi-modal models (LMMs) serving framework. |
|
Emerging |
| 37 |
Troyanovsky/Local-LLM-Comparison-Colab-UI
Compare the performance of different LLM that can be deployed locally on... |
|
Emerging |
| 38 |
PureBee/purebee
A GPU defined in software. Runs Llama 3.2 1B at 3.6 tok/sec. Zero dependencies. |
|
Emerging |
| 39 |
SearchSavior/OpenArc
Inference engine for Intel devices. Serve LLMs, VLMs, Whisper, Kokoro-TTS,... |
|
Emerging |
| 40 |
vectorch-ai/ScaleLLM
A high-performance inference system for large language models, designed for... |
|
Emerging |
| 41 |
bytedance/byteir
A model compilation solution for various hardware |
|
Emerging |
| 42 |
MegEngine/InferLLM
a lightweight LLM model inference framework |
|
Emerging |
| 43 |
RWKV/rwkv.cpp
INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model |
|
Emerging |
| 44 |
zejia-lin/BulletServe
Boosting GPU utilization for LLM serving via dynamic spatial-temporal... |
|
Emerging |
| 45 |
AI-Hypercomputer/jetstream-pytorch
PyTorch/XLA integration with JetStream (https://github.com/google/JetStream)... |
|
Emerging |
| 46 |
andrewkchan/deepseek.cpp
CPU inference for the DeepSeek family of large language models in C++ |
|
Emerging |
| 47 |
powerserve-project/PowerServe
High-speed and easy-use LLM serving framework for local deployment |
|
Emerging |
| 48 |
1b5d/llm-api
Run any Large Language Model behind a unified API |
|
Emerging |
| 49 |
interestingLSY/swiftLLM
A tiny yet powerful LLM inference system tailored for researching purpose.... |
|
Emerging |
| 50 |
SqueezeAILab/LLMCompiler
[ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Calling |
|
Emerging |
| 51 |
chenmozhijin/BSRoformer.cpp
GGML-based C++ inference for BS Roformer/Mel-Band-Roformer vocal separation... |
|
Emerging |
| 52 |
modelscope/dash-infer
DashInfer is a native LLM inference engine aiming to deliver... |
|
Emerging |
| 53 |
invergent-ai/surogate
Insanely fast LLM pre-training and fine-tuning for modern NVIDIA GPUs.... |
|
Emerging |
| 54 |
jdaln/dgx-spark-inference-stack
Serve the home! Inference stack for your Nvidia DGX Spark aka the Grace... |
|
Emerging |
| 55 |
vivy-yi/awesome-llm-training-inference
Curated list of LLM training and inference frameworks, tools, and resources.... |
|
Emerging |
| 56 |
toyaix/TritonLLM
LLM Inference via Triton (Flexible & Modular): Focused on Kernel... |
|
Emerging |
| 57 |
Azure99/BlossomData
A fluent, scalable, and easy-to-use LLM data processing framework. |
|
Emerging |
| 58 |
jankais3r/LLaMA_MPS
Run LLaMA (and Stanford-Alpaca) inference on Apple Silicon GPUs. |
|
Emerging |
| 59 |
TrevTron/indiedroid-nova-llm
Running Llama 3.1 8B and other LLMs on RK3588 NPU - benchmarks and setup guides |
|
Emerging |
| 60 |
thruthseeker/LionLock_FDE_OSS
Open source fatigue detection engine for large language models with trust overlay |
|
Emerging |
| 61 |
aniketmaurya/llm-inference
Large Language Model (LLM) Inference API and Chatbot |
|
Emerging |
| 62 |
hpcaitech/SwiftInfer
Efficient AI Inference & Serving |
|
Emerging |
| 63 |
MrYxJ/calculate-flops.pytorch
The calflops is designed to calculate FLOPs、MACs and Parameters in all... |
|
Emerging |
| 64 |
nareshis21/Truelarge-RT
Android inference engine running 20B+ parameter LLMs on 4GB-8GB RAM devices.... |
|
Emerging |
| 65 |
riccardomusmeci/mlx-llm
Large Language Models (LLMs) applications and tools running on Apple Silicon... |
|
Emerging |
| 66 |
James-QiuHaoran/LLM-serving-with-proxy-models
Efficient Interactive LLM Serving with Proxy Model-based Sequence Length... |
|
Emerging |
| 67 |
efeslab/Nanoflow
A throughput-oriented high-performance serving framework for LLMs |
|
Emerging |
| 68 |
argonne-lcf/LLM-Inference-Bench
LLM-Inference-Bench |
|
Emerging |
| 69 |
AmpereComputingAI/llama.cpp
Ampere optimized llama.cpp |
|
Emerging |
| 70 |
CoderLSF/fast-llama
Runs LLaMA with Extremely HIGH speed |
|
Emerging |
| 71 |
andrewkchan/yalm
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O |
|
Emerging |
| 72 |
tommasocerruti/detllm
Deterministic-mode checks for LLM inference: measure run/batch variance,... |
|
Emerging |
| 73 |
knagrecha/saturn
Saturn accelerates the training of large-scale deep learning models with a... |
|
Emerging |
| 74 |
zRzRzRzRzRzRzR/lm-fly
大模型推理框架加速,让 LLM 飞起来 |
|
Emerging |
| 75 |
rbitr/llm.f90
LLM inference in Fortran |
|
Emerging |
| 76 |
yingding/applyllm
A python package for applying LLM with LangChain and Hugging Face on local... |
|
Emerging |
| 77 |
ShinoharaHare/LLM-Training
A distributed training framework for large language models powered by Lightning. |
|
Emerging |
| 78 |
gunnarnordqvist/opencode-context-filter
Transparent HTTP proxy that automatically filters repository context for... |
|
Experimental |
| 79 |
gotzmann/booster
Booster - open accelerator for LLM models. Better inference and debugging... |
|
Experimental |
| 80 |
AshishGautamX/K8s-LLM-Scheduler
An intelligent Kubernetes scheduler powered by Meta's Llama-3.3-70B model... |
|
Experimental |
| 81 |
psmarter/mini-infer
A high-performance LLM inference engine with PagedAttention |... |
|
Experimental |
| 82 |
moeru-ai/demodel
🚀🛸 Easily boost the speed of pulling your models and datasets from various... |
|
Experimental |
| 83 |
m0dulo/InferSpore
🌱 A fully independent Large Language Model (LLM) inference engine, built... |
|
Experimental |
| 84 |
m-horky/sllm
Tools using small Large Language Models |
|
Experimental |
| 85 |
lucasjinreal/Namo-R1
A CPU Realtime VLM in 500M. Surpassed Moondream2 and SmolVLM. Training from... |
|
Experimental |
| 86 |
KarthikSriramGit/H.E.I.M.D.A.L.L
H.E.I.M.D.A.L.L looks at fleet telemetry and gives you natural-language... |
|
Experimental |
| 87 |
alibaba/easydist
Automated Parallelization System and Infrastructure for Multiple Ecosystems |
|
Experimental |
| 88 |
winstxnhdw/llm-api
A fast CPU-based API for Qwen 2.5 using CTranslate2, hosted on Hugging Face Spaces. |
|
Experimental |
| 89 |
jmaczan/tiny-vllm
High performance LLM inference engine, a younger sibling of vLLM |
|
Experimental |
| 90 |
RahulSChand/gpu_poor
Calculate token/s & GPU memory requirement for any LLM. Supports... |
|
Experimental |
| 91 |
dengls24/LLM-para
Analyze LLM inference: FLOPs, memory, Roofline model. Supports GQA, MoE,... |
|
Experimental |
| 92 |
BenChaliah/NVFP4-on-4090-vLLM
AdaLLM is an NVFP4-first inference runtime for Ada Lovelace (RTX 4090) with... |
|
Experimental |
| 93 |
ToddThomson/Mila
Achilles Mila Deep Neural Network library provides a comprehensive API to... |
|
Experimental |
| 94 |
HyperMink/inferenceable
Scalable AI Inference Server for CPU and GPU with Node.js | Utilizes... |
|
Experimental |
| 95 |
ybubnov/metalchat
Pure C++23 Llama inference for Apple Silicon chips |
|
Experimental |
| 96 |
kennethleungty/DeepSeek-R1-Ollama-Simple-Evals
Run and Evaluate DeepSeek-R1 Distilled Models Locally with Ollama and... |
|
Experimental |
| 97 |
harleyszhang/llm_counts
llm theoretical performance analysis tools and support params, flops, memory... |
|
Experimental |
| 98 |
titanml/takeoff-community
TitanML Takeoff Server is an optimization, compression and deployment... |
|
Experimental |
| 99 |
bpevangelista/vfastml
Inference and Training Engine for LLMs, Image2Image and Other Models |
|
Experimental |
| 100 |
Relaxed-System-Lab/HexGen
[ICML 2024] Serving LLMs on heterogeneous decentralized clusters. |
|
Experimental |
| 101 |
KevinLee1110/dynamic-batching
The official repo for the paper "Optimizing LLM Inference Throughput via... |
|
Experimental |
| 102 |
mjglatzmaier/llm-boostrap
Starter repo for running local LLM inference and lightweight benchmarking on... |
|
Experimental |
| 103 |
HelpingAI/inferno
Run Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and other... |
|
Experimental |
| 104 |
quantumnic/ssd-llm
Run 70B+ LLMs on Apple Silicon by using SSD as extended memory — intelligent... |
|
Experimental |
| 105 |
llm-works/llm-infer
LLM inference server with native, vLLM, and Ollama backends, including a... |
|
Experimental |
| 106 |
VPanjeta/PyLLaMa-CPU
Fast LLaMa inference on CPU using llama.cpp for Python |
|
Experimental |
| 107 |
deepagency/llm-resource-planner
A simple CLI tool to fetch Hugging Face model metadata and estimate required... |
|
Experimental |
| 108 |
TeamADAPT/blitzkernels
BlitzKernels — production WASM inference kernels for edge AI (embedding,... |
|
Experimental |
| 109 |
onlychara553-debug/dgx-spark-inference-stack
🚀 Serve large language models efficiently at home with this Docker-based... |
|
Experimental |
| 110 |
liam8421/faster-llm
🚀 Accelerate LLM training with Fast-LLM, an open-source library for... |
|
Experimental |
| 111 |
MonitooDev/indiedroid-nova-llm
🚀 Benchmark local LLMs like Llama 3.1 on the Indiedroid Nova with RK3588... |
|
Experimental |
| 112 |
changwoolee/BLAST
[NeurIPS 2024] BLAST: Block Level Adaptive Structured Matrix for Efficient... |
|
Experimental |
| 113 |
modelize-ai/LLM-Inference-Deployment-Tutorial
Tutorial for LLM developers about engine design, service deployment,... |
|
Experimental |
| 114 |
rafaelmaza/llmfit-web
Find the best open-source LLM for your GPU/RAM - fit, speed & quality... |
|
Experimental |
| 115 |
AntonioVFranco/elamonica
Production-ready test-time compute optimization framework for LLM inference.... |
|
Experimental |
| 116 |
CornelisKuijpers/SIP-interface
Run 400B+ parameter AI models on consumer hardware with 12GB RAM |
|
Experimental |
| 117 |
landry-some/LLM-streaming
Efficient streaming inference for large language models (LLMs). |
|
Experimental |
| 118 |
darxkies/cpu-slm
A holiday project to better understand the inner workings of SLM/LLM. |
|
Experimental |
| 119 |
johnbrodowski/AutoInferenceBenchmark
AutoInferenceBenchmark is a Windows desktop application for evaluating and... |
|
Experimental |
| 120 |
Artemarius/CuInfer
From-scratch LLM inference engine in C++17/CUDA. Custom kernels, GGUF model... |
|
Experimental |
| 121 |
ThalesMMS/sglang-config
Configuration files and deployment scripts for serving Llama 3.2 3B and Qwen... |
|
Experimental |
| 122 |
EmbeddedLLM/embeddedllm
EmbeddedLLM: API server for Embedded Device Deployment. Currently support... |
|
Experimental |
| 123 |
piotrmaciejbednarski/llm-inference-tampering
Proof-of-concept for persistent manipulation of LLM outputs by modifying... |
|
Experimental |
| 124 |
datvodinh/serve-llm
Serve high throughput and scalable LLM using Ray and vLLM |
|
Experimental |
| 125 |
tensorchord/inference-benchmark
Benchmark for machine learning model online serving (LLM, embedding,... |
|
Experimental |
| 126 |
GPUforLLM/llm-vram-calculator
Accurate VRAM calculator for Local LLMs (Llama 4, DeepSeek V3, Qwen 2.5).... |
|
Experimental |
| 127 |
nitrictech/pycasts
A text to Podcast inference API |
|
Experimental |
| 128 |
Meahg/exvllm
🚀 Enhance vllm with exvllm to utilize MOE mixed inference, enabling... |
|
Experimental |
| 129 |
ictnlp/SiLLM
SiLLM is a Simultaneous Machine Translation (SiMT) Framework. It utilizes a... |
|
Experimental |
| 130 |
isshiki-dev/docker-model-runner
Self-hosted Anthropic API Compatible Inference Server with Claude Code... |
|
Experimental |
| 131 |
arkodeepsen/helix
Professional training stack for 100M parameter language models optimized for... |
|
Experimental |
| 132 |
AMD-AGI/gpt-fast
The GPT-Fast for Multimodal Models on AMD GPUs |
|
Experimental |
| 133 |
virtualramblas/DFloat11_MPS
DFloat11 for Apple Silicon. |
|
Experimental |
| 134 |
rajatady/Inference-Stack
Production-grade LLM inference API built from scratch. NestJS gateway +... |
|
Experimental |
| 135 |
Scieries-Reunies-de-l-Est/llm
LLM deployment api of the Service Commercial company. |
|
Experimental |
| 136 |
1337hero/rx7900xtx-llama-bench-rocm
Benchmark script for llama.cpp & results for AMD RX 7900 XTX |
|
Experimental |
| 137 |
SunayHegde2006/Air.rs
Air.rs 70B+ inference on consumer GPU, LLM inference in Rust |
|
Experimental |
| 138 |
adamydwang/mobilellama
a lightweight C++ LLaMA inference engine for mobile devices |
|
Experimental |
| 139 |
rick97julho/do-i-have-the-vram
🔍 Estimate your VRAM needs for Hugging Face models in seconds without... |
|
Experimental |
| 140 |
vishvaRam/Docker-vLLM-Server-Builder-Runpod
Production-grade, OpenAI-compatible server using vLLM v0.17.0. Deploy LLMs,... |
|
Experimental |
| 141 |
joeddav/illustrated-training-cluster
[WIP] Interactive visualization of LLM training parallelism across GPU clusters |
|
Experimental |
| 142 |
iNeil77/vllm-code-harness
Run code inference-only benchmarks quickly using vLLM |
|
Experimental |
| 143 |
X-rayLaser/DistributedLLM
Run LLM inference by spliting models into parts and hosting each part on a... |
|
Experimental |
| 144 |
rinoScremin/Open_Cluster_AI_Station_beta
High-performance distributed matrix computation for AI workloads. Supports... |
|
Experimental |
| 145 |
getflexai/flex_ai
simplifies fine-tuning and inference for 60+ open-source LLMs through a single API |
|
Experimental |
| 146 |
eniompw/llama-cpp-gpu
Load larger models by offloading model layers to both GPU and CPU |
|
Experimental |
| 147 |
EvanZhuang/rocm_tips
Tips for building and using DL packages for AMD ROCM |
|
Experimental |
| 148 |
karun2328/llm_serving_benchmarks
Benchmarking LLM inference serving with vLLM, analyzing latency, throughput,... |
|
Experimental |
| 149 |
virtualramblas/FlexLLMGenMPS
Running large language models on a single M1/M2 GPU for throughput-oriented... |
|
Experimental |
| 150 |
ZeeetOne/llm-inference-deployment
Practical example of deploying fine-tuned LLMs locally with FastAPI.... |
|
Experimental |
| 151 |
G-B-KEVIN-ARJUN/runtime-inference
"Faster AI: Accelerating Qwen 2.5 from 7 t/s to 82 t/s on a single RTX 4060... |
|
Experimental |
| 152 |
KT313/assistant_base
A custom framework for easy use of LLMs, VLMs, etc. supporting various modes... |
|
Experimental |
| 153 |
di-osc/osc-llm
轻量级大模型推理引擎 |
|
Experimental |