GPTQModel and AutoGPTQ
GPTQModel is a maintained fork and extension of AutoGPTQ that adds hardware acceleration support across multiple backends (CUDA, ROCm, XPU, CPU), while AutoGPTQ represents the original reference implementation of the GPTQ quantization algorithm.
About GPTQModel
ModelCloud/GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.
About AutoGPTQ
AutoGPTQ/AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
**Technical Summary:** Supports weight-only quantization to 4-bit or 3-bit precision with configurable group sizes and activation ordering, integrated with Hugging Face Transformers, Optimum, and PEFT for seamless model loading and fine-tuning. Provides optimized CUDA kernels (with Marlin int4×fp16 matrix multiplication support) and Triton backends for accelerated inference, achieving 25-91 tokens/sec speedups on A100 GPUs while maintaining model accuracy. Targets NVIDIA, AMD ROCm, and Intel Gaudi 2 platforms with pre-built wheels for CUDA 11.8/12.1.
Scores updated daily from GitHub, PyPI, and npm data. How scores work