LLM CUDA Optimization LLM Tools

Low-level CUDA kernel development, GPU memory optimization, and hardware-accelerated inference engines for LLMs. Includes custom GEMM implementations, tensor operations, quantization kernels, and distributed inference backends. Does NOT include high-level inference frameworks, application layers, or non-GPU acceleration methods.

There are 27 llm cuda optimization tools tracked. 1 score above 50 (established tier). The highest-rated is ggml-org/ggml at 68/100 with 14,217 stars. 1 of the top 10 are actively maintained.

Get all 27 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-cuda-optimization&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	ggml-org/ggml Tensor library for machine learning	68	Established	14,217	C++
2	onnx/ir-py Efficient in-memory representation for ONNX, in Python	49	Emerging	43	Python
3	bytedance/lightseq LightSeq: A High Performance Library for Sequence Processing and Generation	46	Emerging	3,304	C++
4	R-D-BioTech-Alaska/Qelm Qelm - Quantum Enhanced Language Model	45	Emerging	25	Python
5	SandAI-org/MagiCompiler A plug-and-play compiler that delivers free-lunch optimizations for both...	45	Emerging	234	Python
6	kekzl/imp High-performance LLM inference engine in C++/CUDA for NVIDIA Blackwell GPUs...	38	Emerging	15	Cuda
7	dongchany/ember A lightweight multi-GPU inference engine for LLMs on mid/low-end GPUs.	36	Emerging	6	C++
8	llcuda/llcuda CUDA 12-first backend inference for Unsloth on Kaggle — Optimized for small...	31	Emerging	8	Jupyter Notebook
9	jjang-ai/jangq JANG — GGUF for MLX. YOU MUST USE JANG_Q RUNTIME. Adaptive Mixed-Precision...	30	Emerging	58	Python
10	artalis-io/bitnet.c Minimal, embeddable LLM inference engine in pure C11. 20+ GGUF quant...	26	Experimental	5	C
11	rockyco/OpenWLAN AI-powered MATLAB-to-HLS framework for WLAN 802.11 synchronization. 3.88x...	24	Experimental	7	Verilog
12	liashchynskyi/ggufer Convert & quantize HuggingFace models using llama.cpp on premises	23	Experimental	2	Jupyter Notebook
13	rockyco/peakPicker A Comprehensive Comparative Study of LLM-Aided FPGA Design Flow	22	Experimental	10	HTML
14	friedpotato04/CUDA-L2 🚀 Optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels using...	22	Experimental	—	Cuda
15	luckystar-pear/llm-compress Compress context data to optimize memory and performance in C++ large...	22	Experimental	—	C++
16	mtmatheuus/QKV-Core 🚀 Run modern 7B LLMs on legacy 4GB GPUs without crashes, breaking the VRAM...	22	Experimental	—	Python
17	Zzzxkxz/cuda-fp8-ampere 🚀 Accelerate FP8 GEMM tasks on RTX 3090 Ti using lightweight storage and...	22	Experimental	—	Cuda
18	saleembarakat4/viva_tensor 🚀 Accelerate your computations with viva_tensor, the fastest tensor library...	22	Experimental	—	Gleam
19	LessUp/tiny-llm Lightweight LLM Inference Engine (CUDA C++17): W8A16 Quantization, KV Cache...	22	Experimental	—	Cuda
20	rockyco/ImageProcessing LLM-Aided FPGA Design Optimization	18	Experimental	16	C
21	amai-gsu/LM-Meter Official code repo of SEC'25 paper: lm-Meter: Unveiling Runtime Inference...	16	Experimental	1	—
22	moham94/mini-sglang 🚀 Harness mini-SGLang to power efficient inference for Large Language Models...	15	Experimental	1	Python
23	ProCoder1199X/NanoAccel Python Library for inference of LLMs on low end hardware and CPU optimizations	15	Experimental	—	Python
24	Wasisange/cuda-kernels-collection Custom CUDA kernels for optimized tensor operations in deep learning.	14	Experimental	—	Cuda
25	0xnu/qrme qrme is a quantum-resistant encrypted machine learning system designed to...	12	Experimental	3	C
26	deependujha/DeepTensor DeepTensor: A minimal PyTorch-like deep learning library focused on custom...	12	Experimental	3	C++
27	K-Wu/intrasm_engine Enhancing CUDA Intra-Streaming-Multiprocessor Parallelism for Large Language...	11	Experimental	—	Jupyter Notebook