LLM Inference Serving LLM Tools

Tools and frameworks for deploying, serving, and scaling LLM inference endpoints in production environments. Includes optimization techniques (quantization, batching, caching), serving platforms (vLLM, Ray Serve, BentoML), and infrastructure solutions. Does NOT include client SDKs, application frameworks, or fine-tuning tools.

There are 72 llm inference serving tools tracked. 1 score above 70 (verified tier). The highest-rated is thu-pacman/chitu at 85/100 with 3,418 stars and 13 monthly downloads. 1 of the top 10 are actively maintained.

Get all 72 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-inference-serving&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	thu-pacman/chitu High-performance inference framework for large language models, focusing on...	85	Verified	3,418	Python
2	NotPunchnox/rkllama Ollama alternative for Rockchip NPU: An efficient solution for running AI...	53	Established	447	Python
3	sophgo/LLM-TPU Run generative AI models in sophgo BM1684X/BM1688	53	Established	271	C++
4	Deep-Spark/DeepSparkHub DeepSparkHub selects hundreds of application algorithms and models, covering...	49	Emerging	70	Python
5	HuaizhengZhang/AI-Infra-from-Zero-to-Hero 🚀 Awesome System for Machine Learning ⚡️ AI System Papers and Industry...	48	Emerging	3,763	—
6	eth-sri/lmql A language for constraint-guided and efficient LLM programming.	46	Emerging	4,161	Python
7	bentoml/llm-inference-handbook Everything you need to know about LLM inference	45	Emerging	269	TypeScript
8	tomdyson/microllama The smallest possible LLM API	45	Emerging	66	Python
9	howard-hou/VisualRWKV VisualRWKV is the visual-enhanced version of the RWKV language model,...	42	Emerging	244	Python
10	ucbepic/BARGAIN Low-Cost LLM-Powered Data Processing with Theoretical Guarantees	41	Emerging	35	Python
11	liguodongiot/llm-resource LLM全栈优质资源汇总	40	Emerging	689	Shell
12	0-mostafa-rezaee-0/Batch_LLM_Inference_with_Ray_Data_LLM Batch LLM Inference with Ray Data LLM: From Simple to Advanced	39	Emerging	12	Jupyter Notebook
13	vicharak-in/Axon-NPU-Guide This repository contains guide on how to setup toolkits to use NPU present...	37	Emerging	25	Shell
14	aws-samples/easy-model-deployer Deploy open-source LLMs on AWS in minutes — with OpenAI-compatible APIs and...	37	Emerging	74	Python
15	FareedKhan-dev/llm-scale-deploy-guide An end-to-end pipeline to optimize and host LLM for 100K parallel queries	37	Emerging	36	Jupyter Notebook
16	Seeed-Projects/reComputer-RK-LLM This repository utilizes Docker to package large language models and...	37	Emerging	3	Python
17	CHKDSKLabs/l-bom L-BOM is a small Python CLI that inspects local LLM model artifacts such as...	37	Emerging	3	Python
18	manuelescobar-dev/LLM-Tools Open-source calculator for LLM system requirements.	37	Emerging	175	Python
19	alibaba/ServeGen A framework for generating realistic LLM serving workloads	36	Emerging	106	Python
20	kungfuai/CVlization Practical workflows for training and inference on AI models	35	Emerging	12	Python
21	Pelochus/ezrknpu Easy installation and usage of Rockchip's NPUs found in RK3588 and similar SoCs	34	Emerging	231	Shell
22	jmaczan/torch-webgpu PyTorch compiler and WebGPU runtime	34	Emerging	14	C++
23	wangcx18/llm-vscode-inference-server An endpoint server for efficiently serving quantized open-source LLMs for code.	33	Emerging	58	Python
24	av1d/rk3588_npu_llm_server Allows access via HTTP to LLM running on RK3588 NPU. Returns JSON response.	32	Emerging	28	C++
25	av1d/NPU-Chat Web chat front end for rk3588_npu_llm_server / RK3588 LLM chat interface	31	Emerging	16	CSS
26	AlexKaravaev/world-creator LLM-based CLI utility for simulation worlds creation.	31	Emerging	209	Python
27	tpietruszka/rate_limited Efficient parallel utilization of slow, rate-limited APIs - like those of...	30	Emerging	10	Python
28	thekevinscott/vicuna-7b Vicuna 7B is a large language model that runs in the browser. Exposes...	30	Emerging	21	JavaScript
29	aws-samples/amazon-sagemaker-llama2-response-streaming-recipes Amazon SageMaker Llama 2 Inference via Response Streaming	29	Experimental	13	Jupyter Notebook
30	serialscriptr/Orange-PI-5-Pro-MLC-LLM Guide I wrote mostly for myself on how to run mlc-llm on the Orange Pi 5 Pro	28	Experimental	23	—
31	Zerohertz/PyCon_KR_2025_Tutorial_vLLM 🐍 PyCon Korea 2025 Tutorial: vLLM의 OpenAI-Compatible Server 톺아보기 🐍	28	Experimental	10	Shell
32	SRSWTI/axis AI eXplainable Inference & Search. Open Sourcing on-premise, ultra-fast...	28	Experimental	37	—
33	wudingjian/rkllm_chat 将LLM 模型部署到 Rockchip Rk3588芯片中，在开发板上使用NPU进行推理	28	Experimental	72	Python
34	plushpluto/kllm Welcome to KLLM, an advanced project focused on core kernel AI development,...	27	Experimental	8	C++
35	tmcarmichael/fabricai-inference-server A hackable, modular, containerized inference server for deploying large...	26	Experimental	6	Python
36	llmcloud24/de.KCD-Summer-School-2024 Learn how to deploy your own LLM in the de.NBI cloud via a step-by-step...	26	Experimental	4	—
37	Leon6225/InternVL3.5-4B-NPU 🌌 Advance multimodal AI with InternVL3.5-4B for RK3588 NPU, enhancing vision...	23	Experimental	1	C++
38	parawaveio/parawave One decorator turns any function into a durable parallel runner.	23	Experimental	1	Python
39	selimsandal/OneShotNPU An NPU designed using an LLM with a single prompt	22	Experimental	7	Verilog
40	cdepillabout/mkAIDerivation Generate a Nix derivation on the fly using an LLM	22	Experimental	31	Python
41	zia1138/rayevolve Experimental project for LLM guided algorithm design and optimization built on ray	22	Experimental	—	Jupyter Notebook
42	toopac01/InternVL3.5-8B-NPU 🌌 Explore InternVL3.5-8B NPU for advanced multimodal capabilities on RK3588,...	22	Experimental	—	C++
43	Joao1PNM/awesome-llm-training-inference Explore frameworks, tools, and resources for efficient large language model...	22	Experimental	—	—
44	christophe0606/MLHelium TinyLlama on Cortex-M55 using CMSIS-DSP and Helium vector instructions	21	Experimental	8	C
45	Notnaton/microllm My own implementation to run inference on local LLM models	21	Experimental	8	Python
46	godaai/llm-inference Resources for Large Language Model Inference	20	Experimental	17	—
47	yy29/aws-ec2-tips-llm-chat-ai Tips for setting up AI & Machine Learning R&D Environment and LLM Training &...	19	Experimental	5	—
48	ravijo/pi-llm Run large language models locally on a Raspberry Pi Zero 2W (512 MB RAM)...	19	Experimental	—	—
49	Zerohertz/Instruct_KR_2025_Summer_Meetup_vLLM 🎹 Instruct.KR 2025 Summer Meetup: 오픈소스 LLM, vLLM으로 Production까지 🎹	19	Experimental	23	Shell
50	imetallica/nano-ai Toolkit to train and build Small LLMs in Elixir	18	Experimental	4	Elixir
51	sajidkhan2067/LLMOnAWS Deploy smaller LLM on AWS Lambda: Phi-2, cost-effective language model	18	Experimental	8	Shell
52	CuzImSlymi/Apertis-LLM Apertis LLM. Clean. Fast. Built Different. Custom LLM architecture designed...	17	Experimental	16	Python
53	jaslatendresse/llm-demo This repository demonstrates how to do inference using llama.cpp on a...	17	Experimental	5	Python
54	ArslanKAS/Serverless-LLM-Amazon-Bedrock You’ll learn how to deploy a large language model-based application into...	17	Experimental	4	Jupyter Notebook
55	romitjain/awesome-llm-systems This repository aims to consolidate resources for learning about systems for LLM	16	Experimental	14	—
56	daslearning-org/OnLLM OnLLM is the platform to run LLM or SLM models using OnnxRuntime directly on...	16	Experimental	1	Python
57	gfhe/LLM 私有化LLM 训练和部署探索	15	Experimental	8	Shell
58	aratan/LLM-CLI LLM aratan/qwen3.5-uncensored:9b	15	Experimental	1	Python
59	oriolrius/sagemaker-llm-endpoint Deploy HuggingFace LLMs on AWS SageMaker with vLLM, OpenAI-compatible API...	15	Experimental	1	Shell
60	playaswd/rwkv-explainer RWKV Explained Visually: Learn How LLM RWKV Models Work with Interactive...	14	Experimental	4	JavaScript
61	ray-project/anyscale-berkeley-ai-hackathon Ray and Anyscale for UC Berkeley AI Hackathon!	14	Experimental	11	Jupyter Notebook
62	ray-project/ray-serve-arize-observe Building Real-Time Inference Pipelines with Ray Serve	14	Experimental	10	Jupyter Notebook
63	CosmonautCode/Tiny-Local-LLM-System A lightweight, self-contained Python project for running a local large...	14	Experimental	3	Python
64	mddunlap924/LLM-Inference-Serving This repository demonstrates LLM execution on CPUs using packages like...	14	Experimental	9	Jupyter Notebook
65	Qually5/distributed-training-ops A collection of scripts and configurations for managing distributed training...	14	Experimental	—	Shell
66	Rustem/ddl-playbook Distributed Deep Learning Playbook	14	Experimental	—	—
67	yutingshih/eai2024-final Enhancing User Privacy by Local Deployment of LLMs, Final Project of EAI 2024 Fall	14	Experimental	1	Jupyter Notebook
68	gbaptista/nano-apps Tiny applications that can be embedded in Nano Bots—small, AI-powered robots...	13	Experimental	7	Clojure
69	look4pritam/InferenceServer-LargeLanguageModels Large Language Models Inference Server	11	Experimental	—	Jupyter Notebook
70	cjmcv/ai-infra-notes Reading notes on the open source code of AI infrastructure (sglang, llm,...	11	Experimental	5	—
71	ParthaPRay/Readability_Ollama_LLM This repo shows the coding of readability analysis of response from...	11	Experimental	—	Python
72	ParthaPRay/python_rust_ollama_analysis This repo shows the coding of how Ollama localized LLMs on raspberry pi 4b...	11	Experimental	—	Rust