GPU Parallel Programming ML Frameworks
Tutorials, guides, and implementations for GPU computing using CUDA and related parallel processing frameworks. Focuses on learning CUDA fundamentals, optimization techniques, and GPU-accelerated computing. Does NOT include ML applications built with GPUs, collective communication libraries, or physics simulations—only the programming language/platform itself.
There are 60 gpu parallel programming frameworks tracked. 3 score above 70 (verified tier). The highest-rated is iree-org/iree at 76/100 with 3,655 stars. 6 of the top 10 are actively maintained.
Get all 60 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=gpu-parallel-programming&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Framework | Score | Tier |
|---|---|---|---|
| 1 |
iree-org/iree
A retargetable MLIR-based machine learning compiler and runtime toolkit. |
|
Verified |
| 2 |
rapidsai/cuml
cuML - RAPIDS Machine Learning Library |
|
Verified |
| 3 |
NVIDIA/cutlass
CUDA Templates and Python DSLs for High-Performance Linear Algebra |
|
Verified |
| 4 |
brucefan1983/GPUMD
Graphics Processing Units Molecular Dynamics |
|
Established |
| 5 |
NVIDIA/nccl
Optimized primitives for collective multi-GPU communication |
|
Established |
| 6 |
uxlfoundation/oneDAL
oneAPI Data Analytics Library (oneDAL) |
|
Established |
| 7 |
ROCm/Tensile
[DEPRECATED] Moved to ROCm/rocm-libraries repo |
|
Established |
| 8 |
openucx/ucc
Unified Collective Communication Library |
|
Established |
| 9 |
ROCm/hipBLASLt
[DEPRECATED] Moved to ROCm/rocm-libraries repo |
|
Established |
| 10 |
libxsmm/libxsmm
Library for specialized dense and sparse matrix operations, and deep... |
|
Established |
| 11 |
uxlfoundation/oneCCL
oneAPI Collective Communications Library (oneCCL) |
|
Established |
| 12 |
XiaoMi/mace
MACE is a deep learning inference framework optimized for mobile... |
|
Emerging |
| 13 |
PaddleJitLab/CUDATutorial
A self-learning tutorail for CUDA High Performance Programing. |
|
Emerging |
| 14 |
google/gematria
Machine learning for machine code. |
|
Emerging |
| 15 |
srush/GPU-Puzzles
Solve puzzles. Learn CUDA. |
|
Emerging |
| 16 |
mratsim/Arraymancer
A fast, ergonomic and portable tensor library in Nim with a deep learning... |
|
Emerging |
| 17 |
Edgecortix-Inc/mera
A Heterogeneous Platform Deep Learning Compiler Framework from EdgeCortix |
|
Emerging |
| 18 |
NVIDIA/GMAT
A toolkit showing GPU's all-round capability in video processing |
|
Emerging |
| 19 |
cuMF/cumf_als
CUDA Matrix Factorization Library with Alternating Least Square (ALS) |
|
Emerging |
| 20 |
hshatti/Tensorium
A platform agnostic fast tensor manipulation library using SIMD when... |
|
Emerging |
| 21 |
gorgonia/tensor
package tensor provides efficient and generic n-dimensional arrays in Go... |
|
Emerging |
| 22 |
OutofAi/cudacanvas
Python Module for PyTorch Tensor Visualisation in CUDA Eliminating CPU Transfer |
|
Emerging |
| 23 |
MegEngine/MegCC
MegCC是一个运行时超轻量,高效,移植简单的深度学习模型编译器 |
|
Emerging |
| 24 |
mc2-project/mc2
A Platform for Secure Analytics and Machine Learning |
|
Emerging |
| 25 |
MetaMachines/mm-ptx
PTX Inject and Stack PTX |
|
Emerging |
| 26 |
Frikallo/axiom
High-performance C++ tensor library with NumPy/PyTorch-like API |
|
Emerging |
| 27 |
bytedance/matxscript
A high-performance, extensible Python AOT compiler. |
|
Emerging |
| 28 |
OAID/TensorFlow-HRT
Heterogeneous Run Time version of TensorFlow. Added heterogeneous... |
|
Emerging |
| 29 |
AndreSlavescu/meTile
python-based eDSL for efficient Metal Shading Language code generation |
|
Emerging |
| 30 |
google/nccl-fastsocket
NCCL Fast Socket is a transport layer plugin to improve NCCL collective... |
|
Emerging |
| 31 |
eedalong/ECE408
Code base and slides for ECE408:Applied Parallel Programming On GPU. |
|
Emerging |
| 32 |
mratsim/laser
The HPC toolbox: fused matrix multiplication, convolution, data-parallel... |
|
Emerging |
| 33 |
HenryNdubuaku/cuda-tutorials
Comprehensive CUDA tutorials for Maths & ML with examples. |
|
Emerging |
| 34 |
wangsiping97/FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline. |
|
Experimental |
| 35 |
AXERA-TECH/ax-npu-kit-650
AI algorithm SDK based on AX650 |
|
Experimental |
| 36 |
openmlsys/openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks. |
|
Experimental |
| 37 |
lawmurray/gpu-course
Deep neural network and Adam optimizer in straight C and CUDA. Accompanies... |
|
Experimental |
| 38 |
Venkat2811/yali
Speed-of-Light SW efficiency by using ultra low-latency primitives for comms... |
|
Experimental |
| 39 |
lcmialichi/php-cuda-ext
Direct NVIDIA CUDA access for PHP. GPU-accelerated tensors, JIT-compiled... |
|
Experimental |
| 40 |
Abrahamduru/mHC.cu
🚀 Implement mHC using CUDA for efficient Manifold-Constrained... |
|
Experimental |
| 41 |
mikeroyal/OpenCL-Guide
OpenCL Guide |
|
Experimental |
| 42 |
realies/microgpt.c
Karpathy's microgpt.py, in C |
|
Experimental |
| 43 |
mrpottermusic/nccl-mesh-plugin
🌐 Enable distributed ML with the NCCL Mesh Plugin for efficient... |
|
Experimental |
| 44 |
SamerMakni/cuda-selector
A simple tool to select the optimal CUDA device based custom criteria. |
|
Experimental |
| 45 |
porosh656/cuPDLPx
🚀 Accelerate your linear programming with cuPDLPx, a GPU-based solver that... |
|
Experimental |
| 46 |
LessUp/hpc-ai-optimization-lab
CUDA HPC Kernel Optimization Textbook: Naive to Tensor Core — GEMM,... |
|
Experimental |
| 47 |
LessUp/cuda-kernel-academy
CUDA Kernel Optimization Academy: SGEMM Tutorial, TensorCraft Ops, HPC... |
|
Experimental |
| 48 |
muhamadsafii-21/cutile-learn
🚀 Learn efficient CUDA programming with cuTile through hands-on tutorials... |
|
Experimental |
| 49 |
priteshgohil/CUDA-programming-tutorial
Get started with CUDA programming |
|
Experimental |
| 50 |
NumPower/numpower-autograd
High performance PHP tensor with autograd (automatic differentiation) and... |
|
Experimental |
| 51 |
gabrielmaialva33/viva_tensor
Pure Gleam tensor library with quantization (INT8, NF4, AWQ), Flash... |
|
Experimental |
| 52 |
nageshnnazare/cuda-know-hows
cuda related stuff |
|
Experimental |
| 53 |
karton3c/kuda
my custom open-source programing language |
|
Experimental |
| 54 |
camarababa/cuda-mastery-guide
🚀 Master CUDA programming with structured lessons covering fundamentals,... |
|
Experimental |
| 55 |
Pects1949/Cpp-Distributed-ML-Framework
A C++ framework for distributed machine learning training, focusing on... |
|
Experimental |
| 56 |
aksayush2005/project-compiled
A Mini Machine Learning Compiler with Hardware-Aware Optimization |
|
Experimental |
| 57 |
Duconnor/Pudding
This is the official repository for the project Pudding. Pudding enables you... |
|
Experimental |
| 58 |
rurumimic/cuda
compute unified device architecture |
|
Experimental |
| 59 |
garrettkinman/SteadyTensor
An ultra-light, ultra-flexible tensor library written in pure Nim. Intended... |
|
Experimental |
| 60 |
ProjectoOfficial/CUDA
Learn cuda step-by-step starting from 0 with these simple and free code... |
|
Experimental |