Tencent/TurboTransformers
a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.
Implements smart batching to eliminate zero-padding overhead for variable-length sequences and provides both Python and C++ APIs that integrate seamlessly with PyTorch models via direct conversion—no offline tuning needed. Uses optimized kernel implementations leveraging BLAS providers (MKL/OpenBLAS) on CPU and Tensor Cores on GPU, achieving 1.88x–13.6x speedups in production WeChat services for tasks like FAQ retrieval and recommendation ranking.
1,542 stars. No commits in the last 6 months.
Stars
1,542
Forks
205
Language
C++
License
—
Category
Last pushed
Jul 18, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/Tencent/TurboTransformers"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
huggingface/transformers-bloom-inference
Fast Inference Solutions for BLOOM
mit-han-lab/lite-transformer
[ICLR 2020] Lite Transformer with Long-Short Range Attention
mit-han-lab/hardware-aware-transformers
[ACL'20] HAT: Hardware-Aware Transformers for Efficient Natural Language Processing
LibreTranslate/Locomotive
Toolkit for training/converting LibreTranslate compatible language models 🚂
aliemo/transfomers-silicon-research
Research and Materials on Hardware implementation of Transformer Model