LLM Quantization Methods Transformer Models
Tools and implementations for quantizing large language models using techniques like GPTQ, AWQ, and KV cache compression to reduce model size and inference costs. Does NOT include general model compression via pruning, distillation, or training optimization.
There are 71 llm quantization methods models tracked. 3 score above 70 (verified tier). The highest-rated is intel/auto-round at 88/100 with 883 stars and 44,854 monthly downloads. 5 of the top 10 are actively maintained.
Get all 71 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-quantization-methods&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Model | Score | Tier |
|---|---|---|---|
| 1 |
intel/auto-round
🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed... |
|
Verified |
| 2 |
ModelCloud/GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support... |
|
Verified |
| 3 |
pytorch/ao
PyTorch native quantization and sparsity for training and inference |
|
Verified |
| 4 |
Picovoice/picollm
On-device LLM Inference Powered by X-Bit Quantization |
|
Established |
| 5 |
NVIDIA/kvpress
LLM KV cache compression made easy |
|
Established |
| 6 |
BlinkDL/RWKV-LM
RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can... |
|
Established |
| 7 |
bodaay/HuggingFaceModelDownloader
Simple go utility to download HuggingFace Models and Datasets |
|
Established |
| 8 |
ddh0/easy-llama
Python package wrapping llama.cpp for on-device LLM inference |
|
Established |
| 9 |
jy-yuan/KIVI
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache |
|
Emerging |
| 10 |
livingbio/fuzzy-json
Fuzzy-JSON is a compact Python package with no dependencies, designed to... |
|
Emerging |
| 11 |
back2matching/turboquant
First open-source TurboQuant KV cache compression for LLM inference. Drop-in... |
|
Emerging |
| 12 |
AutoGPTQ/AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on... |
|
Emerging |
| 13 |
laelhalawani/gguf_modeldb
A quick and optimized solution to manage llama based gguf quantized models,... |
|
Emerging |
| 14 |
calcuis/gguf-core
a simple way to interact llama with gguf |
|
Emerging |
| 15 |
TencentARC/LLaMA-Pro
[ACL 2024] Progressive LLaMA with Block Expansion. |
|
Emerging |
| 16 |
zjysteven/mink-plus-plus
[ICLR'25 Spotlight] Min-K%++: Improved baseline for detecting pre-training... |
|
Emerging |
| 17 |
SqueezeAILab/SqueezeLLM
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization |
|
Emerging |
| 18 |
zackshen/gguf
a GGUF file parser |
|
Emerging |
| 19 |
GAIR-NLP/ProX
[ICML 2025] Programming Every Example: Lifting Pre-training Data Quality... |
|
Emerging |
| 20 |
Michael-A-Kuykendall/shimmytok
Pure Rust tokenizer for GGUF models - llama.cpp compatible |
|
Emerging |
| 21 |
ariannamethod/doe
DoE Janus Architecture: Democracy of Experts |
|
Emerging |
| 22 |
SqueezeAILab/LLM2LLM
[ACL 2024] LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement |
|
Emerging |
| 23 |
NVlabs/RocketKV
[ICML 2025] RocketKV: Accelerating Long-Context LLM Inference via Two-Stage... |
|
Emerging |
| 24 |
AaronFeng753/Ollama-Model-Dumper
Export and Backup Ollama models into GGUF and ModelFile |
|
Emerging |
| 25 |
awneesht/KVShuttle
Benchmark & decision framework for KV cache transfer compression in... |
|
Emerging |
| 26 |
gitctrlx/llama.cu
Llama from scratch in CUDA with Flash Attention. |
|
Emerging |
| 27 |
StargazerX0/ScaleKV
[NeurIPS 2025] ScaleKV: Memory-Efficient Visual Autoregressive Modeling with... |
|
Emerging |
| 28 |
ModelTC/QLLM
[ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate... |
|
Emerging |
| 29 |
Beomi/BitNet-Transformers
0️⃣1️⃣🤗 BitNet-Transformers: Huggingface Transformers Implementation of... |
|
Emerging |
| 30 |
SqueezeAILab/KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with... |
|
Emerging |
| 31 |
monk1337/auto-ollama
run ollama & gguf easily with a single command |
|
Emerging |
| 32 |
laelhalawani/gguf_llama
Wrapper for simplified use of Llama2 GGUF quantized models. |
|
Emerging |
| 33 |
smpanaro/coreml-llm-cli
CLI to demonstrate running a large language model (LLM) on Apple Neural Engine. |
|
Emerging |
| 34 |
Rishit-dagli/GLU
An easy-to-use library for GLU (Gated Linear Units) and GLU variants in TensorFlow. |
|
Emerging |
| 35 |
gpustack/gguf-packer-go
Deliver LLMs of GGUF format via Dockerfile. |
|
Emerging |
| 36 |
LMLK-seal/HuggingGGUF
Hugging Face Model downloader and GGUF Converter. |
|
Emerging |
| 37 |
camenduru/alpaca-lora-colab
Alpaca Lora |
|
Experimental |
| 38 |
Zishan-Shao/FlashSVD
Welcome to the FlashSVD, an activation aware inference system for SVD-based... |
|
Experimental |
| 39 |
leliuga/cohere-configurations
Co:Here Inference configurations |
|
Experimental |
| 40 |
elephantmipt/compressors
A small library with distillation, quantization and pruning pipelines |
|
Experimental |
| 41 |
laelhalawani/glai
glai - GGUF LLAMA AI - Package for simplified model handling and text... |
|
Experimental |
| 42 |
eliahuhorwitz/MoTHer
Official PyTorch Implementation for the "Unsupervised Model Tree Heritage... |
|
Experimental |
| 43 |
codewithdark-git/QuantLLM
QuantLLM is a Python library designed for developers, researchers, and teams... |
|
Experimental |
| 44 |
lpalbou/model-quantizer
Effortlessly quantize, benchmark, and publish Hugging Face models with... |
|
Experimental |
| 45 |
calcuis/llama-core
solo connector core built on llama.cpp |
|
Experimental |
| 46 |
kyegomez/open_qwen
A non-official implementation of Qwen 3.5, as there doesn’t seem to be a... |
|
Experimental |
| 47 |
Evrmind-UK/evr-llama
Runtime binaries for Evrmind EVR-1 models |
|
Experimental |
| 48 |
petermartens98/Qwen3-LLM-Pytorch-Implementation-From-Scratch
Lightweight LLM inspired by Qwen3, built from scratch in PyTorch. Full... |
|
Experimental |
| 49 |
boyazzam/kvcache-autotune
🚀 Optimize your KVCache performance with automatic tuning for efficient... |
|
Experimental |
| 50 |
calcuis/gguf-selector
GGUF selector |
|
Experimental |
| 51 |
calcuis/callgg
GGUF caller |
|
Experimental |
| 52 |
pecharesjoselito/chuck.optimizer
Optimize neural network training by monitoring loss, gradients, and... |
|
Experimental |
| 53 |
arcxteam/gguf-convert-model
Auto GGUF Converter for HuggingFace Hub Models with Multiple Quantizations... |
|
Experimental |
| 54 |
Keyvanhardani/kvcache-autotune
Automatic KV-Cache optimization for HuggingFace Transformers. Find the... |
|
Experimental |
| 55 |
pszemraj/decoder-pytorch-template
Hackable PyTorch template for decoder-only transformer architecture... |
|
Experimental |
| 56 |
SolomonB14D3/intelligent-svd
Knowledge-preserving SVD compression for large language models via... |
|
Experimental |
| 57 |
Kalmantic/peakweights
Data-free discovery of critical LLM weights. One forward pass. No... |
|
Experimental |
| 58 |
bkataru/hf-hub-zig
Zig library and CLI for interacting with the HuggingFace Hub API, with a... |
|
Experimental |
| 59 |
Zoclee/xojo-llama
A wrapper module to do local LLM inference on GGUF models using the... |
|
Experimental |
| 60 |
jaepil/geometric-adam
A Ray Tracing-Inspired Approach to Neural Network Optimization |
|
Experimental |
| 61 |
ambv231/tinyllama-coreml-ios18-quantization
Quantize TinyLlama-1.1B-Chat from PyTorch to CoreML (float16, int8, int4)... |
|
Experimental |
| 62 |
LiteObject/llm-quantization-playground
A hands-on demo project that compares multiple quantization methods for... |
|
Experimental |
| 63 |
zzbright1998/SentenceKV
Official implementation of "SentenceKV: Efficient LLM Inference via... |
|
Experimental |
| 64 |
lciric/gptq-from-scratch
GPTQ post-training quantization from scratch — GPT-2, OPT, LLaMA support |
|
Experimental |
| 65 |
megvii-research/IntLLaMA
IntLLaMA: A fast and light quantization solution for LLaMA |
|
Experimental |
| 66 |
1337hero/rx7900xtx-llama-bench-vulcan
Benchmark script for llama.cpp & results for AMD RX 7900 XTX - using Vulcan |
|
Experimental |
| 67 |
GodreignElgin/llm-comparision
Jupyter Notebook for LLM compression via quantization (INT8, INT4, FP16) and... |
|
Experimental |
| 68 |
MohammadKaso/tiny_Llama_mcp_flutter
edge_flutter enables seamless on-device Large Language Model inference using... |
|
Experimental |
| 69 |
j341nono/LLMGusser
CLI guessing game to identify which LLM (Llama vs Gemma) generated text,... |
|
Experimental |
| 70 |
LMLK-seal/ModelQuants
Professional Model Quantization Converter for HuggingFace Transformers |
|
Experimental |
| 71 |
trifledmatter/model-engine
C++ Implementation of Meta's LLaMA v2 Engine. Credited to ggerganov/llama.cpp |
|
Experimental |