stas00/ml-engineering
Machine Learning Engineering Open Book
Comprehensive guide covering distributed training infrastructure—hardware selection (accelerators, storage, networking), orchestration via SLURM, and debugging techniques—distilled from production experience training BLOOM-176B and IDEFICS-80B. Includes practical benchmarking tools (all_reduce_bench.py, torch-distributed-gpu-test.py), copy-paste commands for common issues, and comparative hardware performance tables to guide cloud architecture decisions.
17,380 stars. Actively maintained with 3 commits in the last 30 days.
Stars
17,380
Forks
1,103
Language
Python
License
CC-BY-SA-4.0
Category
Last pushed
Mar 11, 2026
Commits (30d)
3
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/stas00/ml-engineering"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
SwanHubX/SwanLab
⚡️SwanLab - an open-source, modern-design AI training tracking and visualization tool. Supports...
labmlai/annotated_deep_learning_paper_implementations
🧑🏫 60+ Implementations/tutorials of deep learning papers with side-by-side notes 📝; including...
mdsrqbl/omnihuman
AI model that understands text & humanoids.
analyticalrohit/AI-ML-Cheatsheets
All Stanford Cheatsheets: Artificial Intelligence, Transformers, LLMs, Deep Learning, Machine...
avikumart/LLM-GenAI-Transformers-Notebooks
An repository containing all the LLM notebooks with tutorial and projects