sql-machine-learning/elasticdl
Kubernetes-native Deep Learning Framework
Leverages Kubernetes API calls to dynamically manage worker and parameter server pods, enabling fault-tolerant training where job failures don't interrupt execution—workers can be preempted or killed without requiring checkpoint recovery. Supports TensorFlow (Estimator and Keras) and PyTorch with a minimal CLI interface that automatically distributes training across clusters while integrating with Kubernetes priority-based preemption for elastic resource scheduling.
746 stars. No commits in the last 6 months.
Stars
746
Forks
116
Language
Python
License
MIT
Category
Last pushed
Jan 26, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/sql-machine-learning/elasticdl"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
muna-ai/muna-py
Run AI models anywhere. https://muna.ai/explore
clearml/clearml-pycharm-plugin
ClearML PyCharm Plugin
microsoft/AKSDeploymentTutorial
Tutorial on how to deploy Deep Learning models on GPU enabled Kubernetes cluster
Langhalsdino/Kubernetes-GPU-Guide
This guide should help fellow researchers and hobbyists to easily automate and accelerate there...
KAMONOHASHI/kamonohashi
AI開発プラットフォームKAMONOHASHI