pytorch/ao

PyTorch native quantization and sparsity for training and inference

/ 100

Verified

Provides composable quantization techniques (int4/int8 weight-only, float8 dynamic, QAT) and structured sparsity methods (2:4 semi-structured, block sparsity) with optimized kernels via MSLK, enabling training speedups up to 1.5x and inference gains up to 2.37x. Integrates seamlessly with `torch.compile()`, FSDP2, and popular fine-tuning frameworks (Unsloth, Axolotl, HF Transformers), plus inference backends like vLLM and ExecuTorch for edge deployment.

2,729 stars. Actively maintained with 132 commits in the last 30 days.

No Package No Dependents

Maintenance 25 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 23 / 25

How are scores calculated?

Stars

2,729

Forks

456

Language

Python

License

—

Related models

intel/auto-round

🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...

ModelCloud/GPTQModel

LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...

BlinkDL/RWKV-LM

RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can also be directly...

Picovoice/picollm

On-device LLM Inference Powered by X-Bit Quantization

NVIDIA/kvpress

LLM KV cache compression made easy

Explore Transformer Models

All categories Trending Transformer directory Insights