thu-pacman/chitu

High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.

/ 100

Verified

Supports heterogeneous CPU+GPU inference and on-the-fly quantization conversion (FP4→FP8/BF16), enabling single-card deployment of 671B models. Provides production-grade stability across diverse hardware platforms—NVIDIA, Ascend, Moxin, and Hygon—with scalability from pure CPU setups to large distributed clusters. Optimized for models like DeepSeek, Qwen, and GLM with integrated batching and long-context efficiency techniques.

3,418 stars and 13 monthly downloads. Actively maintained with 128 commits in the last 30 days. Available on PyPI.

Maintenance 25 / 25

Adoption 13 / 25

Maturity 25 / 25

Community 22 / 25

How are scores calculated?

Stars

3,418

Forks

477

Language

Python

License

Apache-2.0

Related tools

NotPunchnox/rkllama

Ollama alternative for Rockchip NPU: An efficient solution for running AI and Deep learning...

sophgo/LLM-TPU

Run generative AI models in sophgo BM1684X/BM1688

Deep-Spark/DeepSparkHub

DeepSparkHub selects hundreds of application algorithms and models, covering various fields of...

HuaizhengZhang/AI-Infra-from-Zero-to-Hero

🚀 Awesome System for Machine Learning ⚡️ AI System Papers and Industry Practice. ⚡️ System for...

eth-sri/lmql

A language for constraint-guided and efficient LLM programming.

Explore LLM Tools

All categories Trending LLM Tool directory Insights