NVIDIA/cutlass

CUDA Templates and Python DSLs for High-Performance Linear Algebra

/ 100

Verified

This project provides specialized tools for developers to create highly optimized linear algebra operations, particularly for matrix-matrix multiplication (GEMM), on NVIDIA GPUs. It takes in computational definitions and data types, and outputs high-performance CUDA kernels. Researchers, performance engineers, and students working on GPU programming for numerical applications would find this useful.

9,426 stars. Actively maintained with 9 commits in the last 30 days.

Use this if you need to develop custom, extremely fast GPU kernels for linear algebra, especially matrix multiplications, using a more accessible Python interface or traditional C++ templates.

Not ideal if you are an end-user simply looking to run existing machine learning models or use standard data science libraries without writing custom GPU code.

GPU programming High-performance computing Numerical optimization Deep learning infrastructure CUDA kernel development

No Package No Dependents

Maintenance 20 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 24 / 25

How are scores calculated?

Stars

9,426

Forks

1,725

Language

C++

License

—

Related frameworks

iree-org/iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

rapidsai/cuml

cuML - RAPIDS Machine Learning Library

brucefan1983/GPUMD

Graphics Processing Units Molecular Dynamics

uxlfoundation/oneDAL

oneAPI Data Analytics Library (oneDAL)

NVIDIA/nccl

Optimized primitives for collective multi-GPU communication

Explore ML Frameworks

All categories Trending ML Framework directory Insights