LLM Inference Engines Transformer Models

Optimized inference engines and serving systems for deploying and running large language models efficiently. Focuses on throughput, latency, memory optimization, and production deployment. Does NOT include training frameworks, fine-tuning methods, quantization techniques, or model architecture implementations.

There are 153 llm inference engines models tracked. 7 score above 70 (verified tier). The highest-rated is vllm-project/vllm at 100/100 with 73,007 stars and 7,953,905 monthly downloads. 10 of the top 10 are actively maintained.

Get all 153 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-inference-engines&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

100
Verified
2 sgl-project/sglang

SGLang is a high-performance serving framework for large language models and...

100
Verified
3 alibaba/MNN

MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba,...

93
Verified
4 xorbitsai/inference

Swap GPT for any LLM by changing a single line of code. Xinference lets you...

89
Verified
5 tensorzero/tensorzero

TensorZero is an open-source stack for industrial-grade LLM applications. It...

89
Verified
6 ARahim3/mlx-tune

Bringing the Unsloth experience to Mac users via Apple's MLX framework

75
Verified
7 gpustack/gpustack

Performance-optimized AI inference on your GPUs. Unlock superior throughput...

71
Verified
8 tenstorrent/tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.

69
Established
9 InternLM/lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

68
Established
10 ModelTC/LightLLM

LightLLM is a Python-based LLM (Large Language Model) inference and serving...

68
Established
11 jd-opensource/xllm

A high-performance inference engine for LLMs, optimized for diverse AI accelerators.

66
Established
12 alibaba/rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

66
Established
13 bigscience-workshop/petals

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x...

63
Established
14 FastFlowLM/FastFlowLM

Run LLMs on AMD Ryzen™ AI NPUs in minutes. Just like Ollama - but...

59
Established
15 zhihu/ZhiLight

A highly optimized LLM inference acceleration engine for Llama and its variants.

59
Established
16 NexaAI/nexa-sdk

Run frontier LLMs and VLMs with day-0 model support across GPU, NPU, and...

57
Established
17 NVIDIA-NeMo/Automodel

Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging...

56
Established
18 Tiiny-AI/PowerInfer

High-speed Large Language Model Serving for Local Deployment

54
Established
19 underneathall/pinferencia

Python + Inference - Model Deployment library in Python. Simplest model...

53
Established
20 GeeeekExplorer/nano-vllm

Nano vLLM

53
Established
21 ai-decentralized/BloomBee

Decentralized LLMs fine-tuning and inference with offloading

51
Established
22 higgsfield-ai/higgsfield

Fault-tolerant, highly scalable GPU orchestration, and a machine learning...

49
Emerging
23 intel/ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM,...

49
Emerging
24 AI-Hypercomputer/JetStream

JetStream is a throughput and memory optimized engine for LLM inference on...

48
Emerging
25 toverainc/willow-inference-server

Open source, local, and self-hosted highly optimized language inference...

47
Emerging
26 microsoft/sarathi-serve

A low-latency & high-throughput serving engine for LLMs

47
Emerging
27 alibaba/InferSim

A Lightweight LLM Inference Performance Simulator

46
Emerging
28 slwang-ustc/nano-vllm-v1

Nano vLLM with vLLM v1's request scheduling strategy and chunked prefill

46
Emerging
29 livepeer/ai-runner

Inference runtime for running different batch and real-time AI pipelines.

46
Emerging
30 Deep-Spark/DeepSparkInference

DeepSparkInference has selected 216 inference models of both small and large...

45
Emerging
31 zhenye234/LLaSA_training

LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis

45
Emerging
32 microsoft/vidur

A large-scale simulation framework for LLM inference

45
Emerging
33 inclusionAI/asystem-awex

A high-performance RL training-inference weight synchronization framework,...

44
Emerging
34 kennethleungty/Llama-2-Open-Source-LLM-CPU-Inference

Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A

44
Emerging
35 vitoplantamura/OnnxStream

Lightweight inference library for ONNX files, written in C++. It can run...

44
Emerging
36 jina-ai/rungpt

An open-source cloud-native of large multi-modal models (LMMs) serving framework.

43
Emerging
37 Troyanovsky/Local-LLM-Comparison-Colab-UI

Compare the performance of different LLM that can be deployed locally on...

43
Emerging
38 PureBee/purebee

A GPU defined in software. Runs Llama 3.2 1B at 3.6 tok/sec. Zero dependencies.

41
Emerging
39 SearchSavior/OpenArc

Inference engine for Intel devices. Serve LLMs, VLMs, Whisper, Kokoro-TTS,...

41
Emerging
40 vectorch-ai/ScaleLLM

A high-performance inference system for large language models, designed for...

40
Emerging
41 bytedance/byteir

A model compilation solution for various hardware

39
Emerging
42 MegEngine/InferLLM

a lightweight LLM model inference framework

39
Emerging
43 RWKV/rwkv.cpp

INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model

38
Emerging
44 zejia-lin/BulletServe

Boosting GPU utilization for LLM serving via dynamic spatial-temporal...

38
Emerging
45 AI-Hypercomputer/jetstream-pytorch

PyTorch/XLA integration with JetStream (https://github.com/google/JetStream)...

37
Emerging
46 andrewkchan/deepseek.cpp

CPU inference for the DeepSeek family of large language models in C++

37
Emerging
47 powerserve-project/PowerServe

High-speed and easy-use LLM serving framework for local deployment

37
Emerging
48 1b5d/llm-api

Run any Large Language Model behind a unified API

37
Emerging
49 interestingLSY/swiftLLM

A tiny yet powerful LLM inference system tailored for researching purpose....

37
Emerging
50 SqueezeAILab/LLMCompiler

[ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Calling

37
Emerging
51 chenmozhijin/BSRoformer.cpp

GGML-based C++ inference for BS Roformer/Mel-Band-Roformer vocal separation...

36
Emerging
52 modelscope/dash-infer

DashInfer is a native LLM inference engine aiming to deliver...

36
Emerging
53 invergent-ai/surogate

Insanely fast LLM pre-training and fine-tuning for modern NVIDIA GPUs....

36
Emerging
54 jdaln/dgx-spark-inference-stack

Serve the home! Inference stack for your Nvidia DGX Spark aka the Grace...

36
Emerging
55 vivy-yi/awesome-llm-training-inference

Curated list of LLM training and inference frameworks, tools, and resources....

35
Emerging
56 toyaix/TritonLLM

LLM Inference via Triton (Flexible & Modular): Focused on Kernel...

35
Emerging
57 Azure99/BlossomData

A fluent, scalable, and easy-to-use LLM data processing framework.

35
Emerging
58 jankais3r/LLaMA_MPS

Run LLaMA (and Stanford-Alpaca) inference on Apple Silicon GPUs.

35
Emerging
59 TrevTron/indiedroid-nova-llm

Running Llama 3.1 8B and other LLMs on RK3588 NPU - benchmarks and setup guides

34
Emerging
60 thruthseeker/LionLock_FDE_OSS

Open source fatigue detection engine for large language models with trust overlay

34
Emerging
61 aniketmaurya/llm-inference

Large Language Model (LLM) Inference API and Chatbot

33
Emerging
62 hpcaitech/SwiftInfer

Efficient AI Inference & Serving

32
Emerging
63 MrYxJ/calculate-flops.pytorch

The calflops is designed to calculate FLOPs、MACs and Parameters in all...

32
Emerging
64 nareshis21/Truelarge-RT

Android inference engine running 20B+ parameter LLMs on 4GB-8GB RAM devices....

32
Emerging
65 riccardomusmeci/mlx-llm

Large Language Models (LLMs) applications and tools running on Apple Silicon...

32
Emerging
66 James-QiuHaoran/LLM-serving-with-proxy-models

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length...

32
Emerging
67 efeslab/Nanoflow

A throughput-oriented high-performance serving framework for LLMs

31
Emerging
68 argonne-lcf/LLM-Inference-Bench

LLM-Inference-Bench

31
Emerging
69 AmpereComputingAI/llama.cpp

Ampere optimized llama.cpp

31
Emerging
70 CoderLSF/fast-llama

Runs LLaMA with Extremely HIGH speed

30
Emerging
71 andrewkchan/yalm

Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O

30
Emerging
72 tommasocerruti/detllm

Deterministic-mode checks for LLM inference: measure run/batch variance,...

30
Emerging
73 knagrecha/saturn

Saturn accelerates the training of large-scale deep learning models with a...

30
Emerging
74 zRzRzRzRzRzRzR/lm-fly

大模型推理框架加速,让 LLM 飞起来

30
Emerging
75 rbitr/llm.f90

LLM inference in Fortran

30
Emerging
76 yingding/applyllm

A python package for applying LLM with LangChain and Hugging Face on local...

30
Emerging
77 ShinoharaHare/LLM-Training

A distributed training framework for large language models powered by Lightning.

30
Emerging
78 gunnarnordqvist/opencode-context-filter

Transparent HTTP proxy that automatically filters repository context for...

29
Experimental
79 gotzmann/booster

Booster - open accelerator for LLM models. Better inference and debugging...

29
Experimental
80 AshishGautamX/K8s-LLM-Scheduler

An intelligent Kubernetes scheduler powered by Meta's Llama-3.3-70B model...

29
Experimental
81 psmarter/mini-infer

A high-performance LLM inference engine with PagedAttention |...

29
Experimental
82 moeru-ai/demodel

🚀🛸 Easily boost the speed of pulling your models and datasets from various...

29
Experimental
83 m0dulo/InferSpore

🌱 A fully independent Large Language Model (LLM) inference engine, built...

29
Experimental
84 m-horky/sllm

Tools using small Large Language Models

29
Experimental
85 lucasjinreal/Namo-R1

A CPU Realtime VLM in 500M. Surpassed Moondream2 and SmolVLM. Training from...

28
Experimental
86 KarthikSriramGit/H.E.I.M.D.A.L.L

H.E.I.M.D.A.L.L looks at fleet telemetry and gives you natural-language...

28
Experimental
87 alibaba/easydist

Automated Parallelization System and Infrastructure for Multiple Ecosystems

28
Experimental
88 winstxnhdw/llm-api

A fast CPU-based API for Qwen 2.5 using CTranslate2, hosted on Hugging Face Spaces.

28
Experimental
89 jmaczan/tiny-vllm

High performance LLM inference engine, a younger sibling of vLLM

27
Experimental
90 RahulSChand/gpu_poor

Calculate token/s & GPU memory requirement for any LLM. Supports...

27
Experimental
91 dengls24/LLM-para

Analyze LLM inference: FLOPs, memory, Roofline model. Supports GQA, MoE,...

26
Experimental
92 BenChaliah/NVFP4-on-4090-vLLM

AdaLLM is an NVFP4-first inference runtime for Ada Lovelace (RTX 4090) with...

26
Experimental
93 ToddThomson/Mila

Achilles Mila Deep Neural Network library provides a comprehensive API to...

26
Experimental
94 HyperMink/inferenceable

Scalable AI Inference Server for CPU and GPU with Node.js | Utilizes...

25
Experimental
95 ybubnov/metalchat

Pure C++23 Llama inference for Apple Silicon chips

25
Experimental
96 kennethleungty/DeepSeek-R1-Ollama-Simple-Evals

Run and Evaluate DeepSeek-R1 Distilled Models Locally with Ollama and...

25
Experimental
97 harleyszhang/llm_counts

llm theoretical performance analysis tools and support params, flops, memory...

24
Experimental
98 titanml/takeoff-community

TitanML Takeoff Server is an optimization, compression and deployment...

24
Experimental
99 bpevangelista/vfastml

Inference and Training Engine for LLMs, Image2Image and Other Models

24
Experimental
100 Relaxed-System-Lab/HexGen

[ICML 2024] Serving LLMs on heterogeneous decentralized clusters.

24
Experimental
101 KevinLee1110/dynamic-batching

The official repo for the paper "Optimizing LLM Inference Throughput via...

24
Experimental
102 mjglatzmaier/llm-boostrap

Starter repo for running local LLM inference and lightweight benchmarking on...

23
Experimental
103 HelpingAI/inferno

Run Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and other...

23
Experimental
104 quantumnic/ssd-llm

Run 70B+ LLMs on Apple Silicon by using SSD as extended memory — intelligent...

23
Experimental
105 llm-works/llm-infer

LLM inference server with native, vLLM, and Ollama backends, including a...

22
Experimental
106 VPanjeta/PyLLaMa-CPU

Fast LLaMa inference on CPU using llama.cpp for Python

22
Experimental
107 deepagency/llm-resource-planner

A simple CLI tool to fetch Hugging Face model metadata and estimate required...

22
Experimental
108 TeamADAPT/blitzkernels

BlitzKernels — production WASM inference kernels for edge AI (embedding,...

22
Experimental
109 onlychara553-debug/dgx-spark-inference-stack

🚀 Serve large language models efficiently at home with this Docker-based...

22
Experimental
110 liam8421/faster-llm

🚀 Accelerate LLM training with Fast-LLM, an open-source library for...

22
Experimental
111 MonitooDev/indiedroid-nova-llm

🚀 Benchmark local LLMs like Llama 3.1 on the Indiedroid Nova with RK3588...

22
Experimental
112 changwoolee/BLAST

[NeurIPS 2024] BLAST: Block Level Adaptive Structured Matrix for Efficient...

20
Experimental
113 modelize-ai/LLM-Inference-Deployment-Tutorial

Tutorial for LLM developers about engine design, service deployment,...

20
Experimental
114 rafaelmaza/llmfit-web

Find the best open-source LLM for your GPU/RAM - fit, speed & quality...

20
Experimental
115 AntonioVFranco/elamonica

Production-ready test-time compute optimization framework for LLM inference....

20
Experimental
116 CornelisKuijpers/SIP-interface

Run 400B+ parameter AI models on consumer hardware with 12GB RAM

19
Experimental
117 landry-some/LLM-streaming

Efficient streaming inference for large language models (LLMs).

19
Experimental
118 darxkies/cpu-slm

A holiday project to better understand the inner workings of SLM/LLM.

19
Experimental
119 johnbrodowski/AutoInferenceBenchmark

AutoInferenceBenchmark is a Windows desktop application for evaluating and...

19
Experimental
120 Artemarius/CuInfer

From-scratch LLM inference engine in C++17/CUDA. Custom kernels, GGUF model...

19
Experimental
121 ThalesMMS/sglang-config

Configuration files and deployment scripts for serving Llama 3.2 3B and Qwen...

19
Experimental
122 EmbeddedLLM/embeddedllm

EmbeddedLLM: API server for Embedded Device Deployment. Currently support...

18
Experimental
123 piotrmaciejbednarski/llm-inference-tampering

Proof-of-concept for persistent manipulation of LLM outputs by modifying...

18
Experimental
124 datvodinh/serve-llm

Serve high throughput and scalable LLM using Ray and vLLM

18
Experimental
125 tensorchord/inference-benchmark

Benchmark for machine learning model online serving (LLM, embedding,...

17
Experimental
126 GPUforLLM/llm-vram-calculator

Accurate VRAM calculator for Local LLMs (Llama 4, DeepSeek V3, Qwen 2.5)....

17
Experimental
127 nitrictech/pycasts

A text to Podcast inference API

17
Experimental
128 Meahg/exvllm

🚀 Enhance vllm with exvllm to utilize MOE mixed inference, enabling...

16
Experimental
129 ictnlp/SiLLM

SiLLM is a Simultaneous Machine Translation (SiMT) Framework. It utilizes a...

16
Experimental
130 isshiki-dev/docker-model-runner

Self-hosted Anthropic API Compatible Inference Server with Claude Code...

16
Experimental
131 arkodeepsen/helix

Professional training stack for 100M parameter language models optimized for...

15
Experimental
132 AMD-AGI/gpt-fast

The GPT-Fast for Multimodal Models on AMD GPUs

15
Experimental
133 virtualramblas/DFloat11_MPS

DFloat11 for Apple Silicon.

15
Experimental
134 rajatady/Inference-Stack

Production-grade LLM inference API built from scratch. NestJS gateway +...

15
Experimental
135 Scieries-Reunies-de-l-Est/llm

LLM deployment api of the Service Commercial company.

15
Experimental
136 1337hero/rx7900xtx-llama-bench-rocm

Benchmark script for llama.cpp & results for AMD RX 7900 XTX

15
Experimental
137 SunayHegde2006/Air.rs

Air.rs 70B+ inference on consumer GPU, LLM inference in Rust

15
Experimental
138 adamydwang/mobilellama

a lightweight C++ LLaMA inference engine for mobile devices

15
Experimental
139 rick97julho/do-i-have-the-vram

🔍 Estimate your VRAM needs for Hugging Face models in seconds without...

14
Experimental
140 vishvaRam/Docker-vLLM-Server-Builder-Runpod

Production-grade, OpenAI-compatible server using vLLM v0.17.0. Deploy LLMs,...

14
Experimental
141 joeddav/illustrated-training-cluster

[WIP] Interactive visualization of LLM training parallelism across GPU clusters

14
Experimental
142 iNeil77/vllm-code-harness

Run code inference-only benchmarks quickly using vLLM

14
Experimental
143 X-rayLaser/DistributedLLM

Run LLM inference by spliting models into parts and hosting each part on a...

13
Experimental
144 rinoScremin/Open_Cluster_AI_Station_beta

High-performance distributed matrix computation for AI workloads. Supports...

12
Experimental
145 getflexai/flex_ai

simplifies fine-tuning and inference for 60+ open-source LLMs through a single API

12
Experimental
146 eniompw/llama-cpp-gpu

Load larger models by offloading model layers to both GPU and CPU

12
Experimental
147 EvanZhuang/rocm_tips

Tips for building and using DL packages for AMD ROCM

11
Experimental
148 karun2328/llm_serving_benchmarks

Benchmarking LLM inference serving with vLLM, analyzing latency, throughput,...

11
Experimental
149 virtualramblas/FlexLLMGenMPS

Running large language models on a single M1/M2 GPU for throughput-oriented...

11
Experimental
150 ZeeetOne/llm-inference-deployment

Practical example of deploying fine-tuned LLMs locally with FastAPI....

11
Experimental
151 G-B-KEVIN-ARJUN/runtime-inference

"Faster AI: Accelerating Qwen 2.5 from 7 t/s to 82 t/s on a single RTX 4060...

11
Experimental
152 KT313/assistant_base

A custom framework for easy use of LLMs, VLMs, etc. supporting various modes...

10
Experimental
153 di-osc/osc-llm

轻量级大模型推理引擎

10
Experimental