LLM Inference Serving LLM Tools

Tools and frameworks for deploying, serving, and scaling LLM inference endpoints in production environments. Includes optimization techniques (quantization, batching, caching), serving platforms (vLLM, Ray Serve, BentoML), and infrastructure solutions. Does NOT include client SDKs, application frameworks, or fine-tuning tools.

There are 72 llm inference serving tools tracked. 1 score above 70 (verified tier). The highest-rated is thu-pacman/chitu at 85/100 with 3,418 stars and 13 monthly downloads. 1 of the top 10 are actively maintained.

Get all 72 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-inference-serving&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 thu-pacman/chitu

High-performance inference framework for large language models, focusing on...

85
Verified
2 NotPunchnox/rkllama

Ollama alternative for Rockchip NPU: An efficient solution for running AI...

53
Established
3 sophgo/LLM-TPU

Run generative AI models in sophgo BM1684X/BM1688

53
Established
4 Deep-Spark/DeepSparkHub

DeepSparkHub selects hundreds of application algorithms and models, covering...

49
Emerging
5 HuaizhengZhang/AI-Infra-from-Zero-to-Hero

🚀 Awesome System for Machine Learning ⚡️ AI System Papers and Industry...

48
Emerging
6 eth-sri/lmql

A language for constraint-guided and efficient LLM programming.

46
Emerging
7 bentoml/llm-inference-handbook

Everything you need to know about LLM inference

45
Emerging
8 tomdyson/microllama

The smallest possible LLM API

45
Emerging
9 howard-hou/VisualRWKV

VisualRWKV is the visual-enhanced version of the RWKV language model,...

42
Emerging
10 ucbepic/BARGAIN

Low-Cost LLM-Powered Data Processing with Theoretical Guarantees

41
Emerging
11 liguodongiot/llm-resource

LLM全栈优质资源汇总

40
Emerging
12 0-mostafa-rezaee-0/Batch_LLM_Inference_with_Ray_Data_LLM

Batch LLM Inference with Ray Data LLM: From Simple to Advanced

39
Emerging
13 vicharak-in/Axon-NPU-Guide

This repository contains guide on how to setup toolkits to use NPU present...

37
Emerging
14 aws-samples/easy-model-deployer

Deploy open-source LLMs on AWS in minutes — with OpenAI-compatible APIs and...

37
Emerging
15 FareedKhan-dev/llm-scale-deploy-guide

An end-to-end pipeline to optimize and host LLM for 100K parallel queries

37
Emerging
16 Seeed-Projects/reComputer-RK-LLM

This repository utilizes Docker to package large language models and...

37
Emerging
17 CHKDSKLabs/l-bom

L-BOM is a small Python CLI that inspects local LLM model artifacts such as...

37
Emerging
18 manuelescobar-dev/LLM-Tools

Open-source calculator for LLM system requirements.

37
Emerging
19 alibaba/ServeGen

A framework for generating realistic LLM serving workloads

36
Emerging
20 kungfuai/CVlization

Practical workflows for training and inference on AI models

35
Emerging
21 Pelochus/ezrknpu

Easy installation and usage of Rockchip's NPUs found in RK3588 and similar SoCs

34
Emerging
22 jmaczan/torch-webgpu

PyTorch compiler and WebGPU runtime

34
Emerging
23 wangcx18/llm-vscode-inference-server

An endpoint server for efficiently serving quantized open-source LLMs for code.

33
Emerging
24 av1d/rk3588_npu_llm_server

Allows access via HTTP to LLM running on RK3588 NPU. Returns JSON response.

32
Emerging
25 av1d/NPU-Chat

Web chat front end for rk3588_npu_llm_server / RK3588 LLM chat interface

31
Emerging
26 AlexKaravaev/world-creator

LLM-based CLI utility for simulation worlds creation.

31
Emerging
27 tpietruszka/rate_limited

Efficient parallel utilization of slow, rate-limited APIs - like those of...

30
Emerging
28 thekevinscott/vicuna-7b

Vicuna 7B is a large language model that runs in the browser. Exposes...

30
Emerging
29 aws-samples/amazon-sagemaker-llama2-response-streaming-recipes

Amazon SageMaker Llama 2 Inference via Response Streaming

29
Experimental
30 serialscriptr/Orange-PI-5-Pro-MLC-LLM

Guide I wrote mostly for myself on how to run mlc-llm on the Orange Pi 5 Pro

28
Experimental
31 Zerohertz/PyCon_KR_2025_Tutorial_vLLM

🐍 PyCon Korea 2025 Tutorial: vLLM의 OpenAI-Compatible Server 톺아보기 🐍

28
Experimental
32 SRSWTI/axis

AI eXplainable Inference & Search. Open Sourcing on-premise, ultra-fast...

28
Experimental
33 wudingjian/rkllm_chat

将LLM 模型部署到 Rockchip Rk3588芯片中,在开发板上使用NPU进行推理

28
Experimental
34 plushpluto/kllm

Welcome to KLLM, an advanced project focused on core kernel AI development,...

27
Experimental
35 tmcarmichael/fabricai-inference-server

A hackable, modular, containerized inference server for deploying large...

26
Experimental
36 llmcloud24/de.KCD-Summer-School-2024

Learn how to deploy your own LLM in the de.NBI cloud via a step-by-step...

26
Experimental
37 Leon6225/InternVL3.5-4B-NPU

🌌 Advance multimodal AI with InternVL3.5-4B for RK3588 NPU, enhancing vision...

23
Experimental
38 parawaveio/parawave

One decorator turns any function into a durable parallel runner.

23
Experimental
39 selimsandal/OneShotNPU

An NPU designed using an LLM with a single prompt

22
Experimental
40 cdepillabout/mkAIDerivation

Generate a Nix derivation on the fly using an LLM

22
Experimental
41 zia1138/rayevolve

Experimental project for LLM guided algorithm design and optimization built on ray

22
Experimental
42 toopac01/InternVL3.5-8B-NPU

🌌 Explore InternVL3.5-8B NPU for advanced multimodal capabilities on RK3588,...

22
Experimental
43 Joao1PNM/awesome-llm-training-inference

Explore frameworks, tools, and resources for efficient large language model...

22
Experimental
44 christophe0606/MLHelium

TinyLlama on Cortex-M55 using CMSIS-DSP and Helium vector instructions

21
Experimental
45 Notnaton/microllm

My own implementation to run inference on local LLM models

21
Experimental
46 godaai/llm-inference

Resources for Large Language Model Inference

20
Experimental
47 yy29/aws-ec2-tips-llm-chat-ai

Tips for setting up AI & Machine Learning R&D Environment and LLM Training &...

19
Experimental
48 ravijo/pi-llm

Run large language models locally on a Raspberry Pi Zero 2W (512 MB RAM)...

19
Experimental
49 Zerohertz/Instruct_KR_2025_Summer_Meetup_vLLM

🎹 Instruct.KR 2025 Summer Meetup: 오픈소스 LLM, vLLM으로 Production까지 🎹

19
Experimental
50 imetallica/nano-ai

Toolkit to train and build Small LLMs in Elixir

18
Experimental
51 sajidkhan2067/LLMOnAWS

Deploy smaller LLM on AWS Lambda: Phi-2, cost-effective language model

18
Experimental
52 CuzImSlymi/Apertis-LLM

Apertis LLM. Clean. Fast. Built Different. Custom LLM architecture designed...

17
Experimental
53 jaslatendresse/llm-demo

This repository demonstrates how to do inference using llama.cpp on a...

17
Experimental
54 ArslanKAS/Serverless-LLM-Amazon-Bedrock

You’ll learn how to deploy a large language model-based application into...

17
Experimental
55 romitjain/awesome-llm-systems

This repository aims to consolidate resources for learning about systems for LLM

16
Experimental
56 daslearning-org/OnLLM

OnLLM is the platform to run LLM or SLM models using OnnxRuntime directly on...

16
Experimental
57 gfhe/LLM

私有化LLM 训练和部署探索

15
Experimental
58 aratan/LLM-CLI

LLM aratan/qwen3.5-uncensored:9b

15
Experimental
59 oriolrius/sagemaker-llm-endpoint

Deploy HuggingFace LLMs on AWS SageMaker with vLLM, OpenAI-compatible API...

15
Experimental
60 playaswd/rwkv-explainer

RWKV Explained Visually: Learn How LLM RWKV Models Work with Interactive...

14
Experimental
61 ray-project/anyscale-berkeley-ai-hackathon

Ray and Anyscale for UC Berkeley AI Hackathon!

14
Experimental
62 ray-project/ray-serve-arize-observe

Building Real-Time Inference Pipelines with Ray Serve

14
Experimental
63 CosmonautCode/Tiny-Local-LLM-System

A lightweight, self-contained Python project for running a local large...

14
Experimental
64 mddunlap924/LLM-Inference-Serving

This repository demonstrates LLM execution on CPUs using packages like...

14
Experimental
65 Qually5/distributed-training-ops

A collection of scripts and configurations for managing distributed training...

14
Experimental
66 Rustem/ddl-playbook

Distributed Deep Learning Playbook

14
Experimental
67 yutingshih/eai2024-final

Enhancing User Privacy by Local Deployment of LLMs, Final Project of EAI 2024 Fall

14
Experimental
68 gbaptista/nano-apps

Tiny applications that can be embedded in Nano Bots—small, AI-powered robots...

13
Experimental
69 look4pritam/InferenceServer-LargeLanguageModels

Large Language Models Inference Server

11
Experimental
70 cjmcv/ai-infra-notes

Reading notes on the open source code of AI infrastructure (sglang, llm,...

11
Experimental
71 ParthaPRay/Readability_Ollama_LLM

This repo shows the coding of readability analysis of response from...

11
Experimental
72 ParthaPRay/python_rust_ollama_analysis

This repo shows the coding of how Ollama localized LLMs on raspberry pi 4b...

11
Experimental