LLM Quantization Methods Transformer Models

Tools and implementations for quantizing large language models using techniques like GPTQ, AWQ, and KV cache compression to reduce model size and inference costs. Does NOT include general model compression via pruning, distillation, or training optimization.

There are 71 llm quantization methods models tracked. 3 score above 70 (verified tier). The highest-rated is intel/auto-round at 88/100 with 883 stars and 44,854 monthly downloads. 5 of the top 10 are actively maintained.

Get all 71 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-quantization-methods&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 intel/auto-round

🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed...

88
Verified
2 ModelCloud/GPTQModel

LLM model quantization (compression) toolkit with hw acceleration support...

86
Verified
3 pytorch/ao

PyTorch native quantization and sparsity for training and inference

74
Verified
4 Picovoice/picollm

On-device LLM Inference Powered by X-Bit Quantization

62
Established
5 NVIDIA/kvpress

LLM KV cache compression made easy

62
Established
6 BlinkDL/RWKV-LM

RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can...

57
Established
7 bodaay/HuggingFaceModelDownloader

Simple go utility to download HuggingFace Models and Datasets

56
Established
8 ddh0/easy-llama

Python package wrapping llama.cpp for on-device LLM inference

55
Established
9 jy-yuan/KIVI

[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

49
Emerging
10 livingbio/fuzzy-json

Fuzzy-JSON is a compact Python package with no dependencies, designed to...

48
Emerging
11 back2matching/turboquant

First open-source TurboQuant KV cache compression for LLM inference. Drop-in...

47
Emerging
12 AutoGPTQ/AutoGPTQ

An easy-to-use LLMs quantization package with user-friendly apis, based on...

46
Emerging
13 laelhalawani/gguf_modeldb

A quick and optimized solution to manage llama based gguf quantized models,...

43
Emerging
14 calcuis/gguf-core

a simple way to interact llama with gguf

42
Emerging
15 TencentARC/LLaMA-Pro

[ACL 2024] Progressive LLaMA with Block Expansion.

41
Emerging
16 zjysteven/mink-plus-plus

[ICLR'25 Spotlight] Min-K%++: Improved baseline for detecting pre-training...

41
Emerging
17 SqueezeAILab/SqueezeLLM

[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization

41
Emerging
18 zackshen/gguf

a GGUF file parser

40
Emerging
19 GAIR-NLP/ProX

[ICML 2025] Programming Every Example: Lifting Pre-training Data Quality...

40
Emerging
20 Michael-A-Kuykendall/shimmytok

Pure Rust tokenizer for GGUF models - llama.cpp compatible

39
Emerging
21 ariannamethod/doe

DoE Janus Architecture: Democracy of Experts

39
Emerging
22 SqueezeAILab/LLM2LLM

[ACL 2024] LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

38
Emerging
23 NVlabs/RocketKV

[ICML 2025] RocketKV: Accelerating Long-Context LLM Inference via Two-Stage...

36
Emerging
24 AaronFeng753/Ollama-Model-Dumper

Export and Backup Ollama models into GGUF and ModelFile

36
Emerging
25 awneesht/KVShuttle

Benchmark & decision framework for KV cache transfer compression in...

36
Emerging
26 gitctrlx/llama.cu

Llama from scratch in CUDA with Flash Attention.

34
Emerging
27 StargazerX0/ScaleKV

[NeurIPS 2025] ScaleKV: Memory-Efficient Visual Autoregressive Modeling with...

34
Emerging
28 ModelTC/QLLM

[ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate...

34
Emerging
29 Beomi/BitNet-Transformers

0️⃣1️⃣🤗 BitNet-Transformers: Huggingface Transformers Implementation of...

34
Emerging
30 SqueezeAILab/KVQuant

[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with...

34
Emerging
31 monk1337/auto-ollama

run ollama & gguf easily with a single command

34
Emerging
32 laelhalawani/gguf_llama

Wrapper for simplified use of Llama2 GGUF quantized models.

32
Emerging
33 smpanaro/coreml-llm-cli

CLI to demonstrate running a large language model (LLM) on Apple Neural Engine.

31
Emerging
34 Rishit-dagli/GLU

An easy-to-use library for GLU (Gated Linear Units) and GLU variants in TensorFlow.

30
Emerging
35 gpustack/gguf-packer-go

Deliver LLMs of GGUF format via Dockerfile.

30
Emerging
36 LMLK-seal/HuggingGGUF

Hugging Face Model downloader and GGUF Converter.

30
Emerging
37 camenduru/alpaca-lora-colab

Alpaca Lora

29
Experimental
38 Zishan-Shao/FlashSVD

Welcome to the FlashSVD, an activation aware inference system for SVD-based...

28
Experimental
39 leliuga/cohere-configurations

Co:Here Inference configurations

27
Experimental
40 elephantmipt/compressors

A small library with distillation, quantization and pruning pipelines

26
Experimental
41 laelhalawani/glai

glai - GGUF LLAMA AI - Package for simplified model handling and text...

26
Experimental
42 eliahuhorwitz/MoTHer

Official PyTorch Implementation for the "Unsupervised Model Tree Heritage...

26
Experimental
43 codewithdark-git/QuantLLM

QuantLLM is a Python library designed for developers, researchers, and teams...

26
Experimental
44 lpalbou/model-quantizer

Effortlessly quantize, benchmark, and publish Hugging Face models with...

25
Experimental
45 calcuis/llama-core

solo connector core built on llama.cpp

24
Experimental
46 kyegomez/open_qwen

A non-official implementation of Qwen 3.5, as there doesn’t seem to be a...

23
Experimental
47 Evrmind-UK/evr-llama

Runtime binaries for Evrmind EVR-1 models

23
Experimental
48 petermartens98/Qwen3-LLM-Pytorch-Implementation-From-Scratch

Lightweight LLM inspired by Qwen3, built from scratch in PyTorch. Full...

22
Experimental
49 boyazzam/kvcache-autotune

🚀 Optimize your KVCache performance with automatic tuning for efficient...

22
Experimental
50 calcuis/gguf-selector

GGUF selector

22
Experimental
51 calcuis/callgg

GGUF caller

22
Experimental
52 pecharesjoselito/chuck.optimizer

Optimize neural network training by monitoring loss, gradients, and...

22
Experimental
53 arcxteam/gguf-convert-model

Auto GGUF Converter for HuggingFace Hub Models with Multiple Quantizations...

21
Experimental
54 Keyvanhardani/kvcache-autotune

Automatic KV-Cache optimization for HuggingFace Transformers. Find the...

20
Experimental
55 pszemraj/decoder-pytorch-template

Hackable PyTorch template for decoder-only transformer architecture...

20
Experimental
56 SolomonB14D3/intelligent-svd

Knowledge-preserving SVD compression for large language models via...

20
Experimental
57 Kalmantic/peakweights

Data-free discovery of critical LLM weights. One forward pass. No...

19
Experimental
58 bkataru/hf-hub-zig

Zig library and CLI for interacting with the HuggingFace Hub API, with a...

19
Experimental
59 Zoclee/xojo-llama

A wrapper module to do local LLM inference on GGUF models using the...

17
Experimental
60 jaepil/geometric-adam

A Ray Tracing-Inspired Approach to Neural Network Optimization

17
Experimental
61 ambv231/tinyllama-coreml-ios18-quantization

Quantize TinyLlama-1.1B-Chat from PyTorch to CoreML (float16, int8, int4)...

16
Experimental
62 LiteObject/llm-quantization-playground

A hands-on demo project that compares multiple quantization methods for...

15
Experimental
63 zzbright1998/SentenceKV

Official implementation of "SentenceKV: Efficient LLM Inference via...

15
Experimental
64 lciric/gptq-from-scratch

GPTQ post-training quantization from scratch — GPT-2, OPT, LLaMA support

15
Experimental
65 megvii-research/IntLLaMA

IntLLaMA: A fast and light quantization solution for LLaMA

15
Experimental
66 1337hero/rx7900xtx-llama-bench-vulcan

Benchmark script for llama.cpp & results for AMD RX 7900 XTX - using Vulcan

15
Experimental
67 GodreignElgin/llm-comparision

Jupyter Notebook for LLM compression via quantization (INT8, INT4, FP16) and...

13
Experimental
68 MohammadKaso/tiny_Llama_mcp_flutter

edge_flutter enables seamless on-device Large Language Model inference using...

11
Experimental
69 j341nono/LLMGusser

CLI guessing game to identify which LLM (Llama vs Gemma) generated text,...

11
Experimental
70 LMLK-seal/ModelQuants

Professional Model Quantization Converter for HuggingFace Transformers

11
Experimental
71 trifledmatter/model-engine

C++ Implementation of Meta's LLaMA v2 Engine. Credited to ggerganov/llama.cpp

11
Experimental

Comparisons in this category