GPU Parallel Programming ML Frameworks

Tutorials, guides, and implementations for GPU computing using CUDA and related parallel processing frameworks. Focuses on learning CUDA fundamentals, optimization techniques, and GPU-accelerated computing. Does NOT include ML applications built with GPUs, collective communication libraries, or physics simulations—only the programming language/platform itself.

There are 60 gpu parallel programming frameworks tracked. 3 score above 70 (verified tier). The highest-rated is iree-org/iree at 76/100 with 3,655 stars. 6 of the top 10 are actively maintained.

Get all 60 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=gpu-parallel-programming&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Framework Score Tier
1 iree-org/iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

76
Verified
2 rapidsai/cuml

cuML - RAPIDS Machine Learning Library

72
Verified
3 NVIDIA/cutlass

CUDA Templates and Python DSLs for High-Performance Linear Algebra

70
Verified
4 brucefan1983/GPUMD

Graphics Processing Units Molecular Dynamics

69
Established
5 NVIDIA/nccl

Optimized primitives for collective multi-GPU communication

67
Established
6 uxlfoundation/oneDAL

oneAPI Data Analytics Library (oneDAL)

67
Established
7 ROCm/Tensile

[DEPRECATED] Moved to ROCm/rocm-libraries repo

57
Established
8 openucx/ucc

Unified Collective Communication Library

57
Established
9 ROCm/hipBLASLt

[DEPRECATED] Moved to ROCm/rocm-libraries repo

56
Established
10 libxsmm/libxsmm

Library for specialized dense and sparse matrix operations, and deep...

54
Established
11 uxlfoundation/oneCCL

oneAPI Collective Communications Library (oneCCL)

53
Established
12 XiaoMi/mace

MACE is a deep learning inference framework optimized for mobile...

49
Emerging
13 PaddleJitLab/CUDATutorial

A self-learning tutorail for CUDA High Performance Programing.

48
Emerging
14 google/gematria

Machine learning for machine code.

46
Emerging
15 srush/GPU-Puzzles

Solve puzzles. Learn CUDA.

45
Emerging
16 mratsim/Arraymancer

A fast, ergonomic and portable tensor library in Nim with a deep learning...

43
Emerging
17 Edgecortix-Inc/mera

A Heterogeneous Platform Deep Learning Compiler Framework from EdgeCortix

42
Emerging
18 NVIDIA/GMAT

A toolkit showing GPU's all-round capability in video processing

40
Emerging
19 cuMF/cumf_als

CUDA Matrix Factorization Library with Alternating Least Square (ALS)

40
Emerging
20 hshatti/Tensorium

A platform agnostic fast tensor manipulation library using SIMD when...

40
Emerging
21 gorgonia/tensor

package tensor provides efficient and generic n-dimensional arrays in Go...

39
Emerging
22 OutofAi/cudacanvas

Python Module for PyTorch Tensor Visualisation in CUDA Eliminating CPU Transfer

39
Emerging
23 MegEngine/MegCC

MegCC是一个运行时超轻量,高效,移植简单的深度学习模型编译器

37
Emerging
24 mc2-project/mc2

A Platform for Secure Analytics and Machine Learning

37
Emerging
25 MetaMachines/mm-ptx

PTX Inject and Stack PTX

36
Emerging
26 Frikallo/axiom

High-performance C++ tensor library with NumPy/PyTorch-like API

35
Emerging
27 bytedance/matxscript

A high-performance, extensible Python AOT compiler.

35
Emerging
28 OAID/TensorFlow-HRT

Heterogeneous Run Time version of TensorFlow. Added heterogeneous...

35
Emerging
29 AndreSlavescu/meTile

python-based eDSL for efficient Metal Shading Language code generation

34
Emerging
30 google/nccl-fastsocket

NCCL Fast Socket is a transport layer plugin to improve NCCL collective...

33
Emerging
31 eedalong/ECE408

Code base and slides for ECE408:Applied Parallel Programming On GPU.

32
Emerging
32 mratsim/laser

The HPC toolbox: fused matrix multiplication, convolution, data-parallel...

30
Emerging
33 HenryNdubuaku/cuda-tutorials

Comprehensive CUDA tutorials for Maths & ML with examples.

30
Emerging
34 wangsiping97/FastGEMV

High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.

29
Experimental
35 AXERA-TECH/ax-npu-kit-650

AI algorithm SDK based on AX650

27
Experimental
36 openmlsys/openmlsys-cuda

Tutorials for writing high-performance GPU operators in AI frameworks.

25
Experimental
37 lawmurray/gpu-course

Deep neural network and Adam optimizer in straight C and CUDA. Accompanies...

24
Experimental
38 Venkat2811/yali

Speed-of-Light SW efficiency by using ultra low-latency primitives for comms...

24
Experimental
39 lcmialichi/php-cuda-ext

Direct NVIDIA CUDA access for PHP. GPU-accelerated tensors, JIT-compiled...

24
Experimental
40 Abrahamduru/mHC.cu

🚀 Implement mHC using CUDA for efficient Manifold-Constrained...

23
Experimental
41 mikeroyal/OpenCL-Guide

OpenCL Guide

23
Experimental
42 realies/microgpt.c

Karpathy's microgpt.py, in C

22
Experimental
43 mrpottermusic/nccl-mesh-plugin

🌐 Enable distributed ML with the NCCL Mesh Plugin for efficient...

22
Experimental
44 SamerMakni/cuda-selector

A simple tool to select the optimal CUDA device based custom criteria.

22
Experimental
45 porosh656/cuPDLPx

🚀 Accelerate your linear programming with cuPDLPx, a GPU-based solver that...

22
Experimental
46 LessUp/hpc-ai-optimization-lab

CUDA HPC Kernel Optimization Textbook: Naive to Tensor Core — GEMM,...

22
Experimental
47 LessUp/cuda-kernel-academy

CUDA Kernel Optimization Academy: SGEMM Tutorial, TensorCraft Ops, HPC...

22
Experimental
48 muhamadsafii-21/cutile-learn

🚀 Learn efficient CUDA programming with cuTile through hands-on tutorials...

22
Experimental
49 priteshgohil/CUDA-programming-tutorial

Get started with CUDA programming

22
Experimental
50 NumPower/numpower-autograd

High performance PHP tensor with autograd (automatic differentiation) and...

21
Experimental
51 gabrielmaialva33/viva_tensor

Pure Gleam tensor library with quantization (INT8, NF4, AWQ), Flash...

21
Experimental
52 nageshnnazare/cuda-know-hows

cuda related stuff

16
Experimental
53 karton3c/kuda

my custom open-source programing language

15
Experimental
54 camarababa/cuda-mastery-guide

🚀 Master CUDA programming with structured lessons covering fundamentals,...

14
Experimental
55 Pects1949/Cpp-Distributed-ML-Framework

A C++ framework for distributed machine learning training, focusing on...

14
Experimental
56 aksayush2005/project-compiled

A Mini Machine Learning Compiler with Hardware-Aware Optimization

14
Experimental
57 Duconnor/Pudding

This is the official repository for the project Pudding. Pudding enables you...

12
Experimental
58 rurumimic/cuda

compute unified device architecture

11
Experimental
59 garrettkinman/SteadyTensor

An ultra-light, ultra-flexible tensor library written in pure Nim. Intended...

11
Experimental
60 ProjectoOfficial/CUDA

Learn cuda step-by-step starting from 0 with these simple and free code...

11
Experimental