Tencent/TurboTransformers

a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.

/ 100

Established

Implements smart batching to eliminate zero-padding overhead for variable-length sequences and provides both Python and C++ APIs that integrate seamlessly with PyTorch models via direct conversion—no offline tuning needed. Uses optimized kernel implementations leveraging BLAS providers (MKL/OpenBLAS) on CPU and Tensor Cores on GPU, achieving 1.88x–13.6x speedups in production WeChat services for tasks like FAQ retrieval and recommendation ranking.

1,542 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 22 / 25

How are scores calculated?

Stars

1,542

Forks

205

Language

C++

License

—

Related models

huggingface/transformers-bloom-inference

Fast Inference Solutions for BLOOM

mit-han-lab/lite-transformer

[ICLR 2020] Lite Transformer with Long-Short Range Attention

mit-han-lab/hardware-aware-transformers

[ACL'20] HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

LibreTranslate/Locomotive

Toolkit for training/converting LibreTranslate compatible language models 🚂

aliemo/transfomers-silicon-research

Research and Materials on Hardware implementation of Transformer Model

Explore Transformer Models

All categories Trending Transformer directory Insights