LLM Quantization Methods Transformer Models

Tools and implementations for quantizing large language models using techniques like GPTQ, AWQ, and KV cache compression to reduce model size and inference costs. Does NOT include general model compression via pruning, distillation, or training optimization.

There are 71 llm quantization methods models tracked. 3 score above 70 (verified tier). The highest-rated is intel/auto-round at 88/100 with 883 stars and 44,854 monthly downloads. 5 of the top 10 are actively maintained.

Get all 71 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-quantization-methods&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	intel/auto-round 🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed...	88	Verified	883	Python
2	ModelCloud/GPTQModel LLM model quantization (compression) toolkit with hw acceleration support...	86	Verified	1,044	Python
3	pytorch/ao PyTorch native quantization and sparsity for training and inference	74	Verified	2,729	Python
4	Picovoice/picollm On-device LLM Inference Powered by X-Bit Quantization	62	Established	305	Python
5	NVIDIA/kvpress LLM KV cache compression made easy	62	Established	954	Python
6	BlinkDL/RWKV-LM RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can...	57	Established	14,414	Python
7	bodaay/HuggingFaceModelDownloader Simple go utility to download HuggingFace Models and Datasets	56	Established	915	Go
8	ddh0/easy-llama Python package wrapping llama.cpp for on-device LLM inference	55	Established	101	Python
9	jy-yuan/KIVI [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache	49	Emerging	359	Python
10	livingbio/fuzzy-json Fuzzy-JSON is a compact Python package with no dependencies, designed to...	48	Emerging	43	Python
11	back2matching/turboquant First open-source TurboQuant KV cache compression for LLM inference. Drop-in...	47	Emerging	5	Python
12	AutoGPTQ/AutoGPTQ An easy-to-use LLMs quantization package with user-friendly apis, based on...	46	Emerging	5,033	Python
13	laelhalawani/gguf_modeldb A quick and optimized solution to manage llama based gguf quantized models,...	43	Emerging	12	Python
14	calcuis/gguf-core a simple way to interact llama with gguf	42	Emerging	5	Python
15	TencentARC/LLaMA-Pro [ACL 2024] Progressive LLaMA with Block Expansion.	41	Emerging	514	Python
16	zjysteven/mink-plus-plus [ICLR'25 Spotlight] Min-K%++: Improved baseline for detecting pre-training...	41	Emerging	54	Python
17	SqueezeAILab/SqueezeLLM [ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization	41	Emerging	713	Python
18	zackshen/gguf a GGUF file parser	40	Emerging	17	Rust
19	GAIR-NLP/ProX [ICML 2025] Programming Every Example: Lifting Pre-training Data Quality...	40	Emerging	266	Python
20	Michael-A-Kuykendall/shimmytok Pure Rust tokenizer for GGUF models - llama.cpp compatible	39	Emerging	14	Rust
21	ariannamethod/doe DoE Janus Architecture: Democracy of Experts	39	Emerging	4	C
22	SqueezeAILab/LLM2LLM [ACL 2024] LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement	38	Emerging	194	Python
23	NVlabs/RocketKV [ICML 2025] RocketKV: Accelerating Long-Context LLM Inference via Two-Stage...	36	Emerging	34	Python
24	AaronFeng753/Ollama-Model-Dumper Export and Backup Ollama models into GGUF and ModelFile	36	Emerging	92	Python
25	awneesht/KVShuttle Benchmark & decision framework for KV cache transfer compression in...	36	Emerging	5	Python
26	gitctrlx/llama.cu Llama from scratch in CUDA with Flash Attention.	34	Emerging	43	Cuda
27	StargazerX0/ScaleKV [NeurIPS 2025] ScaleKV: Memory-Efficient Visual Autoregressive Modeling with...	34	Emerging	50	Python
28	ModelTC/QLLM [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate...	34	Emerging	39	Python
29	Beomi/BitNet-Transformers 0️⃣1️⃣🤗 BitNet-Transformers: Huggingface Transformers Implementation of...	34	Emerging	313	Python
30	SqueezeAILab/KVQuant [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with...	34	Emerging	406	Python
31	monk1337/auto-ollama run ollama & gguf easily with a single command	34	Emerging	52	Shell
32	laelhalawani/gguf_llama Wrapper for simplified use of Llama2 GGUF quantized models.	32	Emerging	7	Python
33	smpanaro/coreml-llm-cli CLI to demonstrate running a large language model (LLM) on Apple Neural Engine.	31	Emerging	124	Swift
34	Rishit-dagli/GLU An easy-to-use library for GLU (Gated Linear Units) and GLU variants in TensorFlow.	30	Emerging	20	Python
35	gpustack/gguf-packer-go Deliver LLMs of GGUF format via Dockerfile.	30	Emerging	15	Go
36	LMLK-seal/HuggingGGUF Hugging Face Model downloader and GGUF Converter.	30	Emerging	13	Python
37	camenduru/alpaca-lora-colab Alpaca Lora	29	Experimental	25	Jupyter Notebook
38	Zishan-Shao/FlashSVD Welcome to the FlashSVD, an activation aware inference system for SVD-based...	28	Experimental	7	Python
39	leliuga/cohere-configurations Co:Here Inference configurations	27	Experimental	10	Go
40	elephantmipt/compressors A small library with distillation, quantization and pruning pipelines	26	Experimental	26	Python
41	laelhalawani/glai glai - GGUF LLAMA AI - Package for simplified model handling and text...	26	Experimental	6	Python
42	eliahuhorwitz/MoTHer Official PyTorch Implementation for the "Unsupervised Model Tree Heritage...	26	Experimental	63	Python
43	codewithdark-git/QuantLLM QuantLLM is a Python library designed for developers, researchers, and teams...	26	Experimental	13	Python
44	lpalbou/model-quantizer Effortlessly quantize, benchmark, and publish Hugging Face models with...	25	Experimental	2	Python
45	calcuis/llama-core solo connector core built on llama.cpp	24	Experimental	1	Python
46	kyegomez/open_qwen A non-official implementation of Qwen 3.5, as there doesn’t seem to be a...	23	Experimental	1	Python
47	Evrmind-UK/evr-llama Runtime binaries for Evrmind EVR-1 models	23	Experimental	1	—
48	petermartens98/Qwen3-LLM-Pytorch-Implementation-From-Scratch Lightweight LLM inspired by Qwen3, built from scratch in PyTorch. Full...	22	Experimental	3	Jupyter Notebook
49	boyazzam/kvcache-autotune 🚀 Optimize your KVCache performance with automatic tuning for efficient...	22	Experimental	—	Python
50	calcuis/gguf-selector GGUF selector	22	Experimental	1	Python
51	calcuis/callgg GGUF caller	22	Experimental	1	Python
52	pecharesjoselito/chuck.optimizer Optimize neural network training by monitoring loss, gradients, and...	22	Experimental	—	C
53	arcxteam/gguf-convert-model Auto GGUF Converter for HuggingFace Hub Models with Multiple Quantizations...	21	Experimental	2	Python
54	Keyvanhardani/kvcache-autotune Automatic KV-Cache optimization for HuggingFace Transformers. Find the...	20	Experimental	1	Python
55	pszemraj/decoder-pytorch-template Hackable PyTorch template for decoder-only transformer architecture...	20	Experimental	1	Python
56	SolomonB14D3/intelligent-svd Knowledge-preserving SVD compression for large language models via...	20	Experimental	1	Python
57	Kalmantic/peakweights Data-free discovery of critical LLM weights. One forward pass. No...	19	Experimental	—	Jupyter Notebook
58	bkataru/hf-hub-zig Zig library and CLI for interacting with the HuggingFace Hub API, with a...	19	Experimental	—	Zig
59	Zoclee/xojo-llama A wrapper module to do local LLM inference on GGUF models using the...	17	Experimental	2	Xojo
60	jaepil/geometric-adam A Ray Tracing-Inspired Approach to Neural Network Optimization	17	Experimental	17	Python
61	ambv231/tinyllama-coreml-ios18-quantization Quantize TinyLlama-1.1B-Chat from PyTorch to CoreML (float16, int8, int4)...	16	Experimental	2	Python
62	LiteObject/llm-quantization-playground A hands-on demo project that compares multiple quantization methods for...	15	Experimental	—	—
63	zzbright1998/SentenceKV Official implementation of "SentenceKV: Efficient LLM Inference via...	15	Experimental	11	Python
64	lciric/gptq-from-scratch GPTQ post-training quantization from scratch — GPT-2, OPT, LLaMA support	15	Experimental	1	Jupyter Notebook
65	megvii-research/IntLLaMA IntLLaMA: A fast and light quantization solution for LLaMA	15	Experimental	18	Python
66	1337hero/rx7900xtx-llama-bench-vulcan Benchmark script for llama.cpp & results for AMD RX 7900 XTX - using Vulcan	15	Experimental	—	Shell
67	GodreignElgin/llm-comparision Jupyter Notebook for LLM compression via quantization (INT8, INT4, FP16) and...	13	Experimental	—	Jupyter Notebook
68	MohammadKaso/tiny_Llama_mcp_flutter edge_flutter enables seamless on-device Large Language Model inference using...	11	Experimental	—	Swift
69	j341nono/LLMGusser CLI guessing game to identify which LLM (Llama vs Gemma) generated text,...	11	Experimental	—	Python
70	LMLK-seal/ModelQuants Professional Model Quantization Converter for HuggingFace Transformers	11	Experimental	—	Python
71	trifledmatter/model-engine C++ Implementation of Meta's LLaMA v2 Engine. Credited to ggerganov/llama.cpp	11	Experimental	2	C

Comparisons in this category

GPTQModel and AutoGPTQ (86 vs 46)