Spark Hadoop ML Pipelines ML Frameworks

Distributed machine learning frameworks and tools built on Apache Spark, Hadoop, or similar big data processing systems for large-scale data processing. Does NOT include standalone ML libraries, REST API wrappers without distributed computation, or Spring Boot microservices without core data processing components.

There are 83 spark hadoop ml pipelines frameworks tracked. 3 score above 50 (established tier). The highest-rated is Angel-ML/angel at 57/100 with 6,785 stars.

Get all 83 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=spark-hadoop-ml-pipelines&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Framework Score Tier
1 Angel-ML/angel

A Flexible and Powerful Parameter Server for large-scale machine learning

57
Established
2 lensacom/sparkit-learn

PySpark + Scikit-learn = Sparkit-learn

53
Established
3 alibaba/Alink

Alink is the Machine Learning algorithm platform based on Flink, developed...

51
Established
4 databricks/spark-sklearn

(Deprecated) Scikit-learn integration package for Apache Spark

44
Emerging
5 OryxProject/oryx

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time...

44
Emerging
6 mahmoudparsian/data-algorithms-book

MapReduce, Spark, Java, and Scala for Data Algorithms Book

44
Emerging
7 kaiwaehner/kafka-streams-machine-learning-examples

This project contains examples which demonstrate how to deploy analytic...

44
Emerging
8 jadianes/spark-py-notebooks

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine...

44
Emerging
9 tirthajyoti/Spark-with-Python

Fundamentals of Spark with Python (using PySpark), code examples

44
Emerging
10 endymecy/spark-ml-source-analysis

spark ml 算法原理剖析以及具体的源码实现分析

44
Emerging
11 MingChen0919/learning-apache-spark

Notes on Apache Spark (pyspark)

44
Emerging
12 flink-extended/dl-on-flink

Deep Learning on Flink aims to integrate Flink and deep learning frameworks...

44
Emerging
13 ShifuML/shifu

An end-to-end machine learning and data mining framework on Hadoop

43
Emerging
14 kaiwaehner/ksql-udf-deep-learning-mqtt-iot

Deep Learning UDF for KSQL for Streaming Anomaly Detection of MQTT IoT Sensor Data

43
Emerging
15 apache/flink-ml

Machine learning library of Apache Flink

43
Emerging
16 romain-e-lacoste/sparklen

A statistical learning toolkit for high-dimensional Hawkes processes in Python

43
Emerging
17 TodoEconometria/ejercicios-bigdata

Complete Big Data course with Python (230h) — SQLite to Kafka to TensorFlow....

42
Emerging
18 kanyun-inc/ytk-learn

Ytk-learn is a distributed machine learning library which implements most of...

42
Emerging
19 kaiwaehner/tensorflow-serving-java-grpc-kafka-streams

Kafka Streams + Java + gRPC + TensorFlow Serving => Stream Processing...

41
Emerging
20 sparkling-graph/sparkling-graph

SparklingGraph provides easy to use set of features that will give you...

40
Emerging
21 ShifuML/guagua

An iterative computing framework for both Hadoop MapReduce and Hadoop YARN.

39
Emerging
22 kaiwaehner/ksql-fork-with-deep-learning-function

Deep Learning UDF for KSQL, the Streaming SQL Engine for Apache Kafka with...

39
Emerging
23 siddhi-io/siddhi-execution-streamingml

Extension that performs streaming machine learning on event streams

38
Emerging
24 XYWENJIE/spring-ai-extension

An extension of Spring AI that supports Alibaba Cloud’s dashscope...

38
Emerging
25 SAP-samples/hana-apl-apis-runtimes

Code examples for SAP HANA Automated Predictive Library (APL). It provides...

37
Emerging
26 sbl-sdsc/mmtf-spark

Methods for the parallel and distributed analysis and mining of the Protein...

35
Emerging
27 viadee/bpmn.ai

Machine learning around business processes

33
Emerging
28 shalini0528/big-data-weather-analysis

Big Data weather analysis using Hadoop MapReduce, Apache Hive, Apache Spark,...

32
Emerging
29 feedzai/feedzai-openml

API for Feedzai's Open Machine Learning that allows to integrate ML...

32
Emerging
30 microsoft/masc

Microsoft's contributions for Spark with Apache Accumulo

32
Emerging
31 arminmoin/ML-Quadrat

ML-Quadrat (ML2) is a Model-Driven Software Engineering (MDSE) tool with...

32
Emerging
32 siddhi-io/siddhi-execution-tensorflow

Extension that adds support for inferences from pre-built TensorFlow SavedModels

31
Emerging
33 adventure-island/springboot-deepar-template

A Java(SpringBoot) template for Java and AWS SageMaker DeepAR model endpoint...

28
Experimental
34 jiumao-org/we-mall

A lightweigh mall, simple and esay.

28
Experimental
35 comet-ml/comet-java-sdk

Comet Java SDK

27
Experimental
36 predictiveworks/cdap-spark

A wrapper for Apache Spark to make machine & deep learning available in...

27
Experimental
37 AlanBinu007/AI_Big-Data_Data-Engineering_and_Distributions

Here we created some projects using Kafka, AI , Data virtualization and...

27
Experimental
38 iaja/scalaLDAvis

Scala-Spark port of https://github.com/bmabey/pyLDAvis for Apache Spark LDA...

26
Experimental
39 mikeroyal/Apache-Spark-Guide

Apache Spark Guide

26
Experimental
40 alipay/jpmml-sparkml-lightgbm

JPMML-SparkML plugin for converting LightGBM-Spark models to PMML

26
Experimental
41 rhinempi/sparkhit

sparkhit - analyzing large scale genomic data on the cloud

26
Experimental
42 IPVS-AS/MMP-Backend

A Model Management Platform (MMP) for Industry 4.0 Environments (Backend)

26
Experimental
43 almo/Machine-Learning

Machine Learning snippets and use cases.

25
Experimental
44 manuparra/taller_SparkR

Taller SparkR para las Jornadas de Usuarios de R

25
Experimental
45 chen0040/java-machine-learning-web-api

A simple machine learning web server that caters for small datasets

25
Experimental
46 AxaFrance/spring-ai-workshop

Exploring interactions with LLMs : Practical insights with Spring AI

25
Experimental
47 perguard/pg-streaming-performance-data

Data collection, feature engineering and machine learning of performance traces

24
Experimental
48 nicolaskrier/spring-ai-examples

Spring AI Examples

24
Experimental
49 AvaAvarai/Java-Parallel-Coordinates-Vis

Java Parallel Coordinates Visualization Tool, to visualize...

23
Experimental
50 senx/warp10-ext-pmml

WarpScript™ PMML Extension

22
Experimental
51 AmrrSalem/Pyspark-Local

Portable self-contained PySpark 3.5 environment for Big Data coursework,...

22
Experimental
52 galafis/spark-kafka-ml-training-pipeline

Distributed ML training pipeline with Spark processing, Kafka ingestion and...

22
Experimental
53 dhchenx/Catla-HS

Catla for Hadoop and Spark (Catla-HS): An open-source system to support...

22
Experimental
54 zzzzz1st/predictorML

Machine learning and prediction service for Niagara NX platform.

22
Experimental
55 maengsanha/bigdata

KMU CS Hot Topics in Big Data

21
Experimental
56 DeathReaper0965/distributed-deeplearning

End to End Distributed Deep Learning Engine, works both with Streaming and...

21
Experimental
57 pneff93/Kafka-R-Realtime-Prediction

This tutorial explains how a machine learning model is applied on real-time data

20
Experimental
58 siddhi-io/siddhi-gpl-execution-pmml

Siddhi extension to evaluate Predictive Model Markup Language (PMML).

17
Experimental
59 nickozoulis/thunderstorm

Investigating the trade-offs of low latency responses over quality when...

17
Experimental
60 kriss024/Spark

Spark for Data Science and ETL process.

16
Experimental
61 neerajkesav/SparkMLJavaExamples

Apache Spark Machine Learning - Java Examples

16
Experimental
62 Mazennaji/ai-intelligence-platform-java-ml

An all-in-one Java Machine Learning platform integrating fraud detection,...

15
Experimental
63 iamirmasoud/pyspark_tutorials

Machine Learning for Big Data using PySpark with real-world projects

15
Experimental
64 Sowdeshwar-99/noise-aware-ml-pipeline

Noise-aware ML pipeline for large-scale agricultural yield prediction using...

15
Experimental
65 TravelXML/APACHE-SPARK-PYSPARK-DATABRICKS-MACHINE-LEARNING-MLIB

Apache Spark Machine Learning project using MLlib and Linear Regression on...

15
Experimental
66 hevc15hamza/pyspark-airfoil-noise-prediction

Predict airfoil self-noise using PySpark with an end-to-end machine learning...

14
Experimental
67 Sishant123/scala-m9k

🚀 Streamline big data processing with Scala and M9K, enhancing performance...

14
Experimental
68 Swapnil-2596/scala-aba

🚀 Transform Scala code into efficient, scalable applications with scala-aba,...

14
Experimental
69 aengusmartindonaire/pyspark-ml-pipeline

PySpark ML classification pipelines for NLP, clinical prediction, and census...

14
Experimental
70 mn-cs/fineweb-spark

FineWeb-Edu dataset analysis using Apache Spark - DSC 232R group project

14
Experimental
71 agoda-com/spark-hpopt

Bayesian hyperparamter tuning for Spark MLLib

13
Experimental
72 MinLee0210/kafka-learning

Learning how to use Kafka

13
Experimental
73 rtybase/pmml-microservice

A toy Phishing Classification Service using PMML for demo purposes

12
Experimental
74 adil-faiyaz98/accelerated-spark-gpu

This repository demonstrates how to significantly accelerate Apache Spark 3...

12
Experimental
75 shakha-de/mnist-java-microservice

Spring Boot Micorservice for MNIST

11
Experimental
76 alikemalocalan/Spark-API

Apache Spark Recommendation/Machine Learning Api Service

11
Experimental
77 MehdiBukhari/oak

A Scalable Concurrent Key-Value Map for Big Data Analytics

11
Experimental
78 Chih-Ling-Hsu/Spark-Machine-Learning-Modules

Machine Learning Modules of Spark MLlib

11
Experimental
79 sivasurya681/PySpark

PySpark-Roadmap is an 18-day structured learning journey that takes you from...

11
Experimental
80 hinzy97/spark-dynamic-executor-time-prediction

Neural Network Models for Predicting Execution Time with Dynamic Executor...

11
Experimental
81 FadilAdz/praktikumBigData

Repository ini berisi rangkaian praktikum Big Data yang mencakup penyimpanan...

11
Experimental
82 daugraph/ParameterServer

Parameter Server using Java

10
Experimental
83 GPalfy/socialnetworkcomments

:memo: Text Data Analysis & Machine Learning on supermarket's Social...

10
Experimental