LLM Inference Serving LLM Tools
Tools and frameworks for deploying, serving, and scaling LLM inference endpoints in production environments. Includes optimization techniques (quantization, batching, caching), serving platforms (vLLM, Ray Serve, BentoML), and infrastructure solutions. Does NOT include client SDKs, application frameworks, or fine-tuning tools.
There are 72 llm inference serving tools tracked. 1 score above 70 (verified tier). The highest-rated is thu-pacman/chitu at 85/100 with 3,418 stars and 13 monthly downloads. 1 of the top 10 are actively maintained.
Get all 72 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-inference-serving&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
thu-pacman/chitu
High-performance inference framework for large language models, focusing on... |
|
Verified |
| 2 |
NotPunchnox/rkllama
Ollama alternative for Rockchip NPU: An efficient solution for running AI... |
|
Established |
| 3 |
sophgo/LLM-TPU
Run generative AI models in sophgo BM1684X/BM1688 |
|
Established |
| 4 |
Deep-Spark/DeepSparkHub
DeepSparkHub selects hundreds of application algorithms and models, covering... |
|
Emerging |
| 5 |
HuaizhengZhang/AI-Infra-from-Zero-to-Hero
🚀 Awesome System for Machine Learning ⚡️ AI System Papers and Industry... |
|
Emerging |
| 6 |
eth-sri/lmql
A language for constraint-guided and efficient LLM programming. |
|
Emerging |
| 7 |
bentoml/llm-inference-handbook
Everything you need to know about LLM inference |
|
Emerging |
| 8 |
tomdyson/microllama
The smallest possible LLM API |
|
Emerging |
| 9 |
howard-hou/VisualRWKV
VisualRWKV is the visual-enhanced version of the RWKV language model,... |
|
Emerging |
| 10 |
ucbepic/BARGAIN
Low-Cost LLM-Powered Data Processing with Theoretical Guarantees |
|
Emerging |
| 11 |
liguodongiot/llm-resource
LLM全栈优质资源汇总 |
|
Emerging |
| 12 |
0-mostafa-rezaee-0/Batch_LLM_Inference_with_Ray_Data_LLM
Batch LLM Inference with Ray Data LLM: From Simple to Advanced |
|
Emerging |
| 13 |
vicharak-in/Axon-NPU-Guide
This repository contains guide on how to setup toolkits to use NPU present... |
|
Emerging |
| 14 |
aws-samples/easy-model-deployer
Deploy open-source LLMs on AWS in minutes — with OpenAI-compatible APIs and... |
|
Emerging |
| 15 |
FareedKhan-dev/llm-scale-deploy-guide
An end-to-end pipeline to optimize and host LLM for 100K parallel queries |
|
Emerging |
| 16 |
Seeed-Projects/reComputer-RK-LLM
This repository utilizes Docker to package large language models and... |
|
Emerging |
| 17 |
CHKDSKLabs/l-bom
L-BOM is a small Python CLI that inspects local LLM model artifacts such as... |
|
Emerging |
| 18 |
manuelescobar-dev/LLM-Tools
Open-source calculator for LLM system requirements. |
|
Emerging |
| 19 |
alibaba/ServeGen
A framework for generating realistic LLM serving workloads |
|
Emerging |
| 20 |
kungfuai/CVlization
Practical workflows for training and inference on AI models |
|
Emerging |
| 21 |
Pelochus/ezrknpu
Easy installation and usage of Rockchip's NPUs found in RK3588 and similar SoCs |
|
Emerging |
| 22 |
jmaczan/torch-webgpu
PyTorch compiler and WebGPU runtime |
|
Emerging |
| 23 |
wangcx18/llm-vscode-inference-server
An endpoint server for efficiently serving quantized open-source LLMs for code. |
|
Emerging |
| 24 |
av1d/rk3588_npu_llm_server
Allows access via HTTP to LLM running on RK3588 NPU. Returns JSON response. |
|
Emerging |
| 25 |
av1d/NPU-Chat
Web chat front end for rk3588_npu_llm_server / RK3588 LLM chat interface |
|
Emerging |
| 26 |
AlexKaravaev/world-creator
LLM-based CLI utility for simulation worlds creation. |
|
Emerging |
| 27 |
tpietruszka/rate_limited
Efficient parallel utilization of slow, rate-limited APIs - like those of... |
|
Emerging |
| 28 |
thekevinscott/vicuna-7b
Vicuna 7B is a large language model that runs in the browser. Exposes... |
|
Emerging |
| 29 |
aws-samples/amazon-sagemaker-llama2-response-streaming-recipes
Amazon SageMaker Llama 2 Inference via Response Streaming |
|
Experimental |
| 30 |
serialscriptr/Orange-PI-5-Pro-MLC-LLM
Guide I wrote mostly for myself on how to run mlc-llm on the Orange Pi 5 Pro |
|
Experimental |
| 31 |
Zerohertz/PyCon_KR_2025_Tutorial_vLLM
🐍 PyCon Korea 2025 Tutorial: vLLM의 OpenAI-Compatible Server 톺아보기 🐍 |
|
Experimental |
| 32 |
SRSWTI/axis
AI eXplainable Inference & Search. Open Sourcing on-premise, ultra-fast... |
|
Experimental |
| 33 |
wudingjian/rkllm_chat
将LLM 模型部署到 Rockchip Rk3588芯片中,在开发板上使用NPU进行推理 |
|
Experimental |
| 34 |
plushpluto/kllm
Welcome to KLLM, an advanced project focused on core kernel AI development,... |
|
Experimental |
| 35 |
tmcarmichael/fabricai-inference-server
A hackable, modular, containerized inference server for deploying large... |
|
Experimental |
| 36 |
llmcloud24/de.KCD-Summer-School-2024
Learn how to deploy your own LLM in the de.NBI cloud via a step-by-step... |
|
Experimental |
| 37 |
Leon6225/InternVL3.5-4B-NPU
🌌 Advance multimodal AI with InternVL3.5-4B for RK3588 NPU, enhancing vision... |
|
Experimental |
| 38 |
parawaveio/parawave
One decorator turns any function into a durable parallel runner. |
|
Experimental |
| 39 |
selimsandal/OneShotNPU
An NPU designed using an LLM with a single prompt |
|
Experimental |
| 40 |
cdepillabout/mkAIDerivation
Generate a Nix derivation on the fly using an LLM |
|
Experimental |
| 41 |
zia1138/rayevolve
Experimental project for LLM guided algorithm design and optimization built on ray |
|
Experimental |
| 42 |
toopac01/InternVL3.5-8B-NPU
🌌 Explore InternVL3.5-8B NPU for advanced multimodal capabilities on RK3588,... |
|
Experimental |
| 43 |
Joao1PNM/awesome-llm-training-inference
Explore frameworks, tools, and resources for efficient large language model... |
|
Experimental |
| 44 |
christophe0606/MLHelium
TinyLlama on Cortex-M55 using CMSIS-DSP and Helium vector instructions |
|
Experimental |
| 45 |
Notnaton/microllm
My own implementation to run inference on local LLM models |
|
Experimental |
| 46 |
godaai/llm-inference
Resources for Large Language Model Inference |
|
Experimental |
| 47 |
yy29/aws-ec2-tips-llm-chat-ai
Tips for setting up AI & Machine Learning R&D Environment and LLM Training &... |
|
Experimental |
| 48 |
ravijo/pi-llm
Run large language models locally on a Raspberry Pi Zero 2W (512 MB RAM)... |
|
Experimental |
| 49 |
Zerohertz/Instruct_KR_2025_Summer_Meetup_vLLM
🎹 Instruct.KR 2025 Summer Meetup: 오픈소스 LLM, vLLM으로 Production까지 🎹 |
|
Experimental |
| 50 |
imetallica/nano-ai
Toolkit to train and build Small LLMs in Elixir |
|
Experimental |
| 51 |
sajidkhan2067/LLMOnAWS
Deploy smaller LLM on AWS Lambda: Phi-2, cost-effective language model |
|
Experimental |
| 52 |
CuzImSlymi/Apertis-LLM
Apertis LLM. Clean. Fast. Built Different. Custom LLM architecture designed... |
|
Experimental |
| 53 |
jaslatendresse/llm-demo
This repository demonstrates how to do inference using llama.cpp on a... |
|
Experimental |
| 54 |
ArslanKAS/Serverless-LLM-Amazon-Bedrock
You’ll learn how to deploy a large language model-based application into... |
|
Experimental |
| 55 |
romitjain/awesome-llm-systems
This repository aims to consolidate resources for learning about systems for LLM |
|
Experimental |
| 56 |
daslearning-org/OnLLM
OnLLM is the platform to run LLM or SLM models using OnnxRuntime directly on... |
|
Experimental |
| 57 |
gfhe/LLM
私有化LLM 训练和部署探索 |
|
Experimental |
| 58 |
aratan/LLM-CLI
LLM aratan/qwen3.5-uncensored:9b |
|
Experimental |
| 59 |
oriolrius/sagemaker-llm-endpoint
Deploy HuggingFace LLMs on AWS SageMaker with vLLM, OpenAI-compatible API... |
|
Experimental |
| 60 |
playaswd/rwkv-explainer
RWKV Explained Visually: Learn How LLM RWKV Models Work with Interactive... |
|
Experimental |
| 61 |
ray-project/anyscale-berkeley-ai-hackathon
Ray and Anyscale for UC Berkeley AI Hackathon! |
|
Experimental |
| 62 |
ray-project/ray-serve-arize-observe
Building Real-Time Inference Pipelines with Ray Serve |
|
Experimental |
| 63 |
CosmonautCode/Tiny-Local-LLM-System
A lightweight, self-contained Python project for running a local large... |
|
Experimental |
| 64 |
mddunlap924/LLM-Inference-Serving
This repository demonstrates LLM execution on CPUs using packages like... |
|
Experimental |
| 65 |
Qually5/distributed-training-ops
A collection of scripts and configurations for managing distributed training... |
|
Experimental |
| 66 |
Rustem/ddl-playbook
Distributed Deep Learning Playbook |
|
Experimental |
| 67 |
yutingshih/eai2024-final
Enhancing User Privacy by Local Deployment of LLMs, Final Project of EAI 2024 Fall |
|
Experimental |
| 68 |
gbaptista/nano-apps
Tiny applications that can be embedded in Nano Bots—small, AI-powered robots... |
|
Experimental |
| 69 |
look4pritam/InferenceServer-LargeLanguageModels
Large Language Models Inference Server |
|
Experimental |
| 70 |
cjmcv/ai-infra-notes
Reading notes on the open source code of AI infrastructure (sglang, llm,... |
|
Experimental |
| 71 |
ParthaPRay/Readability_Ollama_LLM
This repo shows the coding of readability analysis of response from... |
|
Experimental |
| 72 |
ParthaPRay/python_rust_ollama_analysis
This repo shows the coding of how Ollama localized LLMs on raspberry pi 4b... |
|
Experimental |