huggingface/text-generation-inference

Large Language Model Text Generation Inference

/ 100

Verified

Built in Rust with Python bindings and gRPC support, TGI implements continuous batching, tensor parallelism across GPUs, and optimized kernels using Flash Attention and Paged Attention for popular model architectures. It provides OpenAI-compatible Chat Completion API endpoints alongside streaming generation via Server-Sent Events, with comprehensive quantization support (bitsandbytes, GPTQ, AWQ, Marlin, fp8) and guidance features for constrained output formats. Now in maintenance mode, it pioneered the shift toward transformers-based model architectures that downstream engines like vLLM and SGLang have adopted.

10,802 stars and 143,543 monthly downloads. Used by 3 other packages. Actively maintained with 1 commit in the last 30 days. Available on PyPI.

Maintenance 13 / 25

Adoption 23 / 25

Maturity 25 / 25

Community 21 / 25

How are scores calculated?

Stars

10,802

Forks

1,261

Language

Python

License

Apache-2.0

Related models

poloclub/transformer-explainer

Transformer Explained Visually: Learn How LLM Transformer Models Work with Interactive Visualization

OpenMachine-ai/transformer-tricks

A collection of tricks and tools to speed up transformer models

IBM/TabFormer

Code & Data for "Tabular Transformers for Modeling Multivariate Time Series" (ICASSP, 2021)

tensorgi/TPA

[NeurIPS 2025 Spotlight] TPA: Tensor ProducT ATTenTion Transformer (T6)...

lorenzorovida/FHE-BERT-Tiny

Source code for the paper "Transformer-based Language Models and Homomorphic Encryption: an...

Explore Transformer Models

All categories Trending Transformer directory Insights