stas00/ml-engineering

Machine Learning Engineering Open Book

60
/ 100
Established

Comprehensive guide covering distributed training infrastructure—hardware selection (accelerators, storage, networking), orchestration via SLURM, and debugging techniques—distilled from production experience training BLOOM-176B and IDEFICS-80B. Includes practical benchmarking tools (all_reduce_bench.py, torch-distributed-gpu-test.py), copy-paste commands for common issues, and comparative hardware performance tables to guide cloud architecture decisions.

17,380 stars. Actively maintained with 3 commits in the last 30 days.

No Package No Dependents
Maintenance 16 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 18 / 25

How are scores calculated?

Stars

17,380

Forks

1,103

Language

Python

License

CC-BY-SA-4.0

Last pushed

Mar 11, 2026

Commits (30d)

3

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/stas00/ml-engineering"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.