LLM CUDA Optimization LLM Tools

Low-level CUDA kernel development, GPU memory optimization, and hardware-accelerated inference engines for LLMs. Includes custom GEMM implementations, tensor operations, quantization kernels, and distributed inference backends. Does NOT include high-level inference frameworks, application layers, or non-GPU acceleration methods.

There are 27 llm cuda optimization tools tracked. 1 score above 50 (established tier). The highest-rated is ggml-org/ggml at 68/100 with 14,217 stars. 1 of the top 10 are actively maintained.

Get all 27 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-cuda-optimization&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 ggml-org/ggml

Tensor library for machine learning

68
Established
2 onnx/ir-py

Efficient in-memory representation for ONNX, in Python

49
Emerging
3 bytedance/lightseq

LightSeq: A High Performance Library for Sequence Processing and Generation

46
Emerging
4 R-D-BioTech-Alaska/Qelm

Qelm - Quantum Enhanced Language Model

45
Emerging
5 SandAI-org/MagiCompiler

A plug-and-play compiler that delivers free-lunch optimizations for both...

45
Emerging
6 kekzl/imp

High-performance LLM inference engine in C++/CUDA for NVIDIA Blackwell GPUs...

38
Emerging
7 dongchany/ember

A lightweight multi-GPU inference engine for LLMs on mid/low-end GPUs.

36
Emerging
8 llcuda/llcuda

CUDA 12-first backend inference for Unsloth on Kaggle — Optimized for small...

31
Emerging
9 jjang-ai/jangq

JANG — GGUF for MLX. YOU MUST USE JANG_Q RUNTIME. Adaptive Mixed-Precision...

30
Emerging
10 artalis-io/bitnet.c

Minimal, embeddable LLM inference engine in pure C11. 20+ GGUF quant...

26
Experimental
11 rockyco/OpenWLAN

AI-powered MATLAB-to-HLS framework for WLAN 802.11 synchronization. 3.88x...

24
Experimental
12 liashchynskyi/ggufer

Convert & quantize HuggingFace models using llama.cpp on premises

23
Experimental
13 rockyco/peakPicker

A Comprehensive Comparative Study of LLM-Aided FPGA Design Flow

22
Experimental
14 friedpotato04/CUDA-L2

🚀 Optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels using...

22
Experimental
15 luckystar-pear/llm-compress

Compress context data to optimize memory and performance in C++ large...

22
Experimental
16 mtmatheuus/QKV-Core

🚀 Run modern 7B LLMs on legacy 4GB GPUs without crashes, breaking the VRAM...

22
Experimental
17 Zzzxkxz/cuda-fp8-ampere

🚀 Accelerate FP8 GEMM tasks on RTX 3090 Ti using lightweight storage and...

22
Experimental
18 saleembarakat4/viva_tensor

🚀 Accelerate your computations with viva_tensor, the fastest tensor library...

22
Experimental
19 LessUp/tiny-llm

Lightweight LLM Inference Engine (CUDA C++17): W8A16 Quantization, KV Cache...

22
Experimental
20 rockyco/ImageProcessing

LLM-Aided FPGA Design Optimization

18
Experimental
21 amai-gsu/LM-Meter

Official code repo of SEC'25 paper: lm-Meter: Unveiling Runtime Inference...

16
Experimental
22 moham94/mini-sglang

🚀 Harness mini-SGLang to power efficient inference for Large Language Models...

15
Experimental
23 ProCoder1199X/NanoAccel

Python Library for inference of LLMs on low end hardware and CPU optimizations

15
Experimental
24 Wasisange/cuda-kernels-collection

Custom CUDA kernels for optimized tensor operations in deep learning.

14
Experimental
25 0xnu/qrme

qrme is a quantum-resistant encrypted machine learning system designed to...

12
Experimental
26 deependujha/DeepTensor

DeepTensor: A minimal PyTorch-like deep learning library focused on custom...

12
Experimental
27 K-Wu/intrasm_engine

Enhancing CUDA Intra-Streaming-Multiprocessor Parallelism for Large Language...

11
Experimental