NVIDIA/cutlass
CUDA Templates and Python DSLs for High-Performance Linear Algebra
This project provides specialized tools for developers to create highly optimized linear algebra operations, particularly for matrix-matrix multiplication (GEMM), on NVIDIA GPUs. It takes in computational definitions and data types, and outputs high-performance CUDA kernels. Researchers, performance engineers, and students working on GPU programming for numerical applications would find this useful.
9,426 stars. Actively maintained with 9 commits in the last 30 days.
Use this if you need to develop custom, extremely fast GPU kernels for linear algebra, especially matrix multiplications, using a more accessible Python interface or traditional C++ templates.
Not ideal if you are an end-user simply looking to run existing machine learning models or use standard data science libraries without writing custom GPU code.
Stars
9,426
Forks
1,725
Language
C++
License
—
Category
Last pushed
Mar 12, 2026
Commits (30d)
9
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/NVIDIA/cutlass"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related frameworks
iree-org/iree
A retargetable MLIR-based machine learning compiler and runtime toolkit.
rapidsai/cuml
cuML - RAPIDS Machine Learning Library
brucefan1983/GPUMD
Graphics Processing Units Molecular Dynamics
uxlfoundation/oneDAL
oneAPI Data Analytics Library (oneDAL)
NVIDIA/nccl
Optimized primitives for collective multi-GPU communication