LLM CUDA Optimization LLM Tools
Low-level CUDA kernel development, GPU memory optimization, and hardware-accelerated inference engines for LLMs. Includes custom GEMM implementations, tensor operations, quantization kernels, and distributed inference backends. Does NOT include high-level inference frameworks, application layers, or non-GPU acceleration methods.
There are 27 llm cuda optimization tools tracked. 1 score above 50 (established tier). The highest-rated is ggml-org/ggml at 68/100 with 14,217 stars. 1 of the top 10 are actively maintained.
Get all 27 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-cuda-optimization&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
ggml-org/ggml
Tensor library for machine learning |
|
Established |
| 2 |
onnx/ir-py
Efficient in-memory representation for ONNX, in Python |
|
Emerging |
| 3 |
bytedance/lightseq
LightSeq: A High Performance Library for Sequence Processing and Generation |
|
Emerging |
| 4 |
R-D-BioTech-Alaska/Qelm
Qelm - Quantum Enhanced Language Model |
|
Emerging |
| 5 |
SandAI-org/MagiCompiler
A plug-and-play compiler that delivers free-lunch optimizations for both... |
|
Emerging |
| 6 |
kekzl/imp
High-performance LLM inference engine in C++/CUDA for NVIDIA Blackwell GPUs... |
|
Emerging |
| 7 |
dongchany/ember
A lightweight multi-GPU inference engine for LLMs on mid/low-end GPUs. |
|
Emerging |
| 8 |
llcuda/llcuda
CUDA 12-first backend inference for Unsloth on Kaggle — Optimized for small... |
|
Emerging |
| 9 |
jjang-ai/jangq
JANG — GGUF for MLX. YOU MUST USE JANG_Q RUNTIME. Adaptive Mixed-Precision... |
|
Emerging |
| 10 |
artalis-io/bitnet.c
Minimal, embeddable LLM inference engine in pure C11. 20+ GGUF quant... |
|
Experimental |
| 11 |
rockyco/OpenWLAN
AI-powered MATLAB-to-HLS framework for WLAN 802.11 synchronization. 3.88x... |
|
Experimental |
| 12 |
liashchynskyi/ggufer
Convert & quantize HuggingFace models using llama.cpp on premises |
|
Experimental |
| 13 |
rockyco/peakPicker
A Comprehensive Comparative Study of LLM-Aided FPGA Design Flow |
|
Experimental |
| 14 |
friedpotato04/CUDA-L2
🚀 Optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels using... |
|
Experimental |
| 15 |
luckystar-pear/llm-compress
Compress context data to optimize memory and performance in C++ large... |
|
Experimental |
| 16 |
mtmatheuus/QKV-Core
🚀 Run modern 7B LLMs on legacy 4GB GPUs without crashes, breaking the VRAM... |
|
Experimental |
| 17 |
Zzzxkxz/cuda-fp8-ampere
🚀 Accelerate FP8 GEMM tasks on RTX 3090 Ti using lightweight storage and... |
|
Experimental |
| 18 |
saleembarakat4/viva_tensor
🚀 Accelerate your computations with viva_tensor, the fastest tensor library... |
|
Experimental |
| 19 |
LessUp/tiny-llm
Lightweight LLM Inference Engine (CUDA C++17): W8A16 Quantization, KV Cache... |
|
Experimental |
| 20 |
rockyco/ImageProcessing
LLM-Aided FPGA Design Optimization |
|
Experimental |
| 21 |
amai-gsu/LM-Meter
Official code repo of SEC'25 paper: lm-Meter: Unveiling Runtime Inference... |
|
Experimental |
| 22 |
moham94/mini-sglang
🚀 Harness mini-SGLang to power efficient inference for Large Language Models... |
|
Experimental |
| 23 |
ProCoder1199X/NanoAccel
Python Library for inference of LLMs on low end hardware and CPU optimizations |
|
Experimental |
| 24 |
Wasisange/cuda-kernels-collection
Custom CUDA kernels for optimized tensor operations in deep learning. |
|
Experimental |
| 25 |
0xnu/qrme
qrme is a quantum-resistant encrypted machine learning system designed to... |
|
Experimental |
| 26 |
deependujha/DeepTensor
DeepTensor: A minimal PyTorch-like deep learning library focused on custom... |
|
Experimental |
| 27 |
K-Wu/intrasm_engine
Enhancing CUDA Intra-Streaming-Multiprocessor Parallelism for Large Language... |
|
Experimental |