HPC Cluster Management ML Frameworks

Resources, guides, and tools for setting up, configuring, and managing HPC clusters and distributed computing infrastructure for ML workloads. Does NOT include general cloud computing platforms, containerization tools, or ML frameworks themselves.

There are 32 hpc cluster management frameworks tracked. 4 score above 50 (established tier). The highest-rated is qualcomm/ai-hub-models at 68/100 with 940 stars. 2 of the top 10 are actively maintained.

Get all 32 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=hpc-cluster-management&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Framework Score Tier
1 qualcomm/ai-hub-models

Qualcomm® AI Hub Models is our collection of state-of-the-art machine...

68
Established
2 lincc-frameworks/hyrax

Hyrax - A low-code framework for rapid experimentation with ML &...

58
Established
3 petuum/adaptdl

Resource-adaptive cluster scheduler for deep learning training.

56
Established
4 zszazi/Deep-learning-in-cloud

List of Deep Learning Cloud Providers

52
Established
5 openhackathons-org/gpubootcamp

This repository consists for gpu bootcamp material for HPC and AI

44
Emerging
6 intel/ai-reference-models

Intel® AI Reference Models: contains Intel optimizations for running deep...

44
Emerging
7 HydroRoll-Team/HydroRoll

跨平台、多任务、高度自定义的骰系开发框架。

40
Emerging
8 HPCNow/hpcnow-labs

HPCNow! training material and hands-on sessions

32
Emerging
9 pescap/EasyHPC

A practical introduction to High Performance Computing (HPC)

31
Emerging
10 opencomputeproject/ocp-diag-windtunnel

Building & testing private AI on HPC.

29
Experimental
11 ray-project/ray-acm-workshop-2023

Scalable/Distributed Computer Vision with Ray

29
Experimental
12 debnsuma/ray-for-developers

A comprehensive hands-on guide to building production-grade distributed...

29
Experimental
13 binga/cloud-gpus

This repository contains information about Cloud GPU offerings for Machine...

29
Experimental
14 hkust-hpc-team/hkust-hpc

Handbook for AI / HPC users on HKUST central clusters

28
Experimental
15 Roulbac/uv-func

A Python decorator to run functions in isolated virtual environments...

27
Experimental
16 knagrecha/hydra

Execution framework for multi-task model parallelism. Enables the training...

26
Experimental
17 onlyrobot/bray

Bray is based on Ray and outperforms Ray in practical distributed...

23
Experimental
18 gpu-cli/zerostart

Fast cold starts for GPU Python. Streaming wheel extraction for when large...

23
Experimental
19 Skyld-Labs/ModelHunter

ModelHunter is a powerful pipeline designed to extract machine learning...

22
Experimental
20 uw-mad-dash/shockwave

Artifact for "Shockwave: Fair and Efficient Cluster Scheduling for Dynamic...

20
Experimental
21 hydra-hoard/hydra

A decentralised application that creates high quality machine learning datasets

20
Experimental
22 jonathandinu/spark-ray-data-science

Supporting content (slides and exercises) for the Pearson video series...

19
Experimental
23 parisimaa/NYU-HPC

NYU HPC user instruction

18
Experimental
24 breadboardfoundry/GPU-Infrastructure

GPU compute infrastructure for research teams running machine learning experiments.

15
Experimental
25 Adhytm/multi-gpu-debug-notes

Debugging and isolating GPU context preemption issus in heterogeneous...

14
Experimental
26 RichardScottOZ/experimenta-ml-kiro

experimenta-ml for kiro-cli

14
Experimental
27 Syntex-errorCode/stable-flakes

🔗 Stabilize your Flakes easily with one input for reliable NixOS modules and...

14
Experimental
28 erectbranch/enroot-on-slurm

Examples of using Enroot with Slurm for distributed deep learning

13
Experimental
29 SupreethRao99/slurmy

template scripts and notes for using SLURM on Nvidia DGX GPU cluster

11
Experimental
30 Akshay3510/Hydra

🔍 Develop advanced knowledge compilers and #SAT solvers with Hydra, a robust...

11
Experimental
31 alifzl/NeSI-Project-Template

NeSI HPC DL project Scaffolding Template

10
Experimental
32 smirko-dev/machine-learning-rpi

Setup ML for Raspberry Pi

10
Experimental