huggingface/text-generation-inference
Large Language Model Text Generation Inference
Built in Rust with Python bindings and gRPC support, TGI implements continuous batching, tensor parallelism across GPUs, and optimized kernels using Flash Attention and Paged Attention for popular model architectures. It provides OpenAI-compatible Chat Completion API endpoints alongside streaming generation via Server-Sent Events, with comprehensive quantization support (bitsandbytes, GPTQ, AWQ, Marlin, fp8) and guidance features for constrained output formats. Now in maintenance mode, it pioneered the shift toward transformers-based model architectures that downstream engines like vLLM and SGLang have adopted.
10,802 stars and 143,543 monthly downloads. Used by 3 other packages. Actively maintained with 1 commit in the last 30 days. Available on PyPI.
Stars
10,802
Forks
1,261
Language
Python
License
Apache-2.0
Category
Last pushed
Jan 08, 2026
Monthly downloads
143,543
Commits (30d)
1
Dependencies
3
Reverse dependents
3
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/huggingface/text-generation-inference"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
poloclub/transformer-explainer
Transformer Explained Visually: Learn How LLM Transformer Models Work with Interactive Visualization
OpenMachine-ai/transformer-tricks
A collection of tricks and tools to speed up transformer models
IBM/TabFormer
Code & Data for "Tabular Transformers for Modeling Multivariate Time Series" (ICASSP, 2021)
tensorgi/TPA
[NeurIPS 2025 Spotlight] TPA: Tensor ProducT ATTenTion Transformer (T6)...
lorenzorovida/FHE-BERT-Tiny
Source code for the paper "Transformer-based Language Models and Homomorphic Encryption: an...