Data Quality Preprocessing ML Frameworks

Tools and techniques for assessing, cleaning, and preparing datasets for machine learning. Includes data validation, outlier detection, missing value handling, and dataset quality frameworks. Does NOT include domain-specific cleaning (e.g., text-only or image-only), general data science tutorials without code frameworks, or downstream ML modeling tasks.

There are 102 data quality preprocessing frameworks tracked. 4 score above 70 (verified tier). The highest-rated is biolab/orange3 at 90/100 with 5,573 stars and 33,517 monthly downloads. 4 of the top 10 are actively maintained.

Get all 102 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=data-quality-preprocessing&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Framework Score Tier
1 biolab/orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis

90
Verified
2 skrub-data/skrub

Machine learning with dataframes

84
Verified
3 cleanlab/cleanlab

Cleanlab's open-source library is the standard data-centric AI package for...

76
Verified
4 root-project/root

The official repository for ROOT: analyzing, storing and visualizing big...

76
Verified
5 fbdesignpro/sweetviz

Visualize and compare datasets, target values and associations, with one...

68
Established
6 drivendataorg/deon

A command line tool to easily add an ethics checklist to your data science projects.

64
Established
7 deepnote/deepnote

Deepnote is a drop-in replacement for Jupyter with an AI-first design, sleek...

63
Established
8 JasonKessler/scattertext

Beautiful visualizations of how language differs among document types.

60
Established
9 deepnote/deepnote-toolkit

Essential Python toolkit for Deepnote environments

58
Established
10 rhiever/datacleaner

A Python tool that automatically cleans data sets and readies them for analysis.

58
Established
11 bodo-ai/PyDough

Analytics DSL for Python

56
Established
12 Renumics/spotlight

Interactively explore unstructured datasets from your dataframe.

55
Established
13 ShimantoRahman/empulse

Value-driven and cost-sensitive analysis for scikit-learn

54
Established
14 AutoViML/pandas_dq

Find data quality issues and clean your data in a single line of code with a...

52
Established
15 SERG-Delft/dslinter

`dslinter` is a pylint plugin for linting data science and machine learning...

49
Emerging
16 IRT-SystemX/dqm-ml

A library to compute data quality metrics

49
Emerging
17 Data-Centric-AI-Community/ydata-quality

Data Quality assessment with one line of code

47
Emerging
18 ml-tooling/ml-workspace

🛠 All-in-one web-based IDE specialized for machine learning and data science.

47
Emerging
19 PAIR-code/facets

Visualizations for machine learning datasets

47
Emerging
20 msamogh/nonechucks

Deal with bad samples in your dataset dynamically, use Transforms as...

47
Emerging
21 MPEDS/mpeds

Machine-learning Protest Event Data System

46
Emerging
22 COM6012/ScalableML

COM6012 Scalable Machine Learning - University of Sheffield. Enjoy our...

46
Emerging
23 scienxlab/redflag

Safety net for machine learning pipelines. Plays nice with sklearn and pandas.

45
Emerging
24 altermarkive/shrubbery

Numerai Experiments

44
Emerging
25 buabaj/xplore

A python package built for data scientist/analysts, AI/ML engineers for...

44
Emerging
26 Digital-Dermatology/SelfClean

[NeurIPS 2024] 🧼🔎 A holistic self-supervised data cleaning strategy to...

43
Emerging
27 gretl-project/gretl

Official mirror of the actively maintained repo on sourceforge

43
Emerging
28 Olow304/Data-Science-Machine-Learning

The overall objective of this toolkit is to provide and offer a free...

40
Emerging
29 JacksonBurns/astartes

Better Data Splits for Machine Learning

40
Emerging
30 Renumics/sliceguard

A library for detecting problematic data segments in structured and...

39
Emerging
31 matthewfeickert-talks/reproducible-ml-for-scientists-with-pixi-scipy-2025

SciPy 2025 tutorial on "Reproducible Machine Learning Workflows for...

39
Emerging
32 cssr-tools/ML_near_well

Runfiles for an ML near-well model and to reproduce results from the article...

38
Emerging
33 fusion-jena/MLProvLab

Provenance Management for Data Science Notebooks

37
Emerging
34 Safe-DS/Stub-Generator

Automated generation of Safe-DS stubs for Python libraries.

37
Emerging
35 Livingston-k/cleanPyData

cleanPyData is a Python package for data cleaning and preprocessing. It...

36
Emerging
36 PKNU-PR-ML-Lab/orange

오렌지로 쉽게 배우는 머신러닝과 데이터 분석 (오렌지3)

36
Emerging
37 pierpaolo28/Data-Visualization

Collection of interactive Jupiter Notebook widgets and graphs.

35
Emerging
38 genular/pandora

PANDORA :computer:

35
Emerging
39 France-Travail/gabarit

Gabarit : kickstart your data science project from scratch

35
Emerging
40 HelikarLab/candis

:ribbon: A data mining suite for gene expression data.

35
Emerging
41 microsoft/Data-Discovery-Toolkit

A data discovery and manipulation toolset for unstructured data

35
Emerging
42 synapticore-io/marimo-flow

Interactive ML notebooks with reactive updates, AI assistance, and MLflow tracking

34
Emerging
43 cdr-book/cdr-book.github.io

Repository for the website of the book (github hosting support)

33
Emerging
44 HazyResearch/meerkat

Explore and understand your training and validation data.

33
Emerging
45 ThomasWong2022/numerai-benchmark

Python Code used in publications, for archival purposes only

32
Emerging
46 khuyentran1401/reproducible-data-science

Tutorials on creating a reproducible and maintainable data science project

32
Emerging
47 seedatnabeel/Data-IQ

Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular...

31
Emerging
48 pmaji/data-science-toolkit

Collection of stats, modeling, and data science tools in Python and R.

31
Emerging
49 BinaryResearch/centrifuge-toolkit

Tool for visualizing and empirically analyzing information encoded in binary files

31
Emerging
50 dirty-data-science/python

Tutorial material on machine learning with dirty data in Python

31
Emerging
51 CharlesAverill/satyrn

A Notebook alternative that supports branching code and local collaboration.

30
Emerging
52 councilofelders/numereval

A small library to locally calculate the scores on numer.ai tournament's...

28
Experimental
53 seedatnabeel/Data-SUITE

Data-SUITE: Data-centric identification of in-distribution incongruous...

28
Experimental
54 awojinrin/ML-Workflow-for-the-Determination-of-Hole-Cleaning-Conditions

A repo containing Jupyter notebooks where ensemble algorithms are...

27
Experimental
55 fuseml/examples

A collection of machine learning projects serving as sample applications...

27
Experimental
56 gianlucatruda/numerai

Quant. trading with ML on Numerai

27
Experimental
57 ahmedshahriar/PulsePoint-Data-Analytics

EDA, data processing, cleaning and extensive geospatial analysis on a...

26
Experimental
58 sumanthprabhu/DQC-Toolkit

Quality Checks for Training Data in Machine Learning

26
Experimental
59 adipolak/scaling-machine-learning-course

Scaling Machine Learning in Three Week course in a collaboration with...

26
Experimental
60 numerai/signals-example-scripts

The official example scripts for the Numerai Signals Data Science Tournament

26
Experimental
61 sultanul-ovi/GPU-Cluster-Spot-Resource-Dataset-Analysis

Detailed Analysis Traces for AI jobs leveraging spot GPU resources

26
Experimental
62 NatanMish/data_validation

Tutorial for implementing data validation in data science pipelines

25
Experimental
63 giagiannis/data-profiler

Data profiler is an attempt to model the behavior of a given operator for a...

24
Experimental
64 s-kav/ds_tools

Library consisting of additional & helpful functions for data science research stages

24
Experimental
65 Diogolsn10/statistical-analysis

Provide well-documented statistical analysis tools in Python, R, and Stata...

23
Experimental
66 galafis/awesome-data-science-toolkit

🚀 Comprehensive toolkit for data scientists with Python utilities, ML...

23
Experimental
67 sbettid/GPSClean

An application to correct a GPS trace using machine learning techniques. To...

23
Experimental
68 Vis4Sense/ml-prov-binder

The code for running our Jupyter Lab extension on https://mybinder.org/

23
Experimental
69 pawlyk/dsml-tools

set of Data Science and Machine Learning tools

23
Experimental
70 vyshakA/Orange-VoIP-FreePBX-Trunk

📞 Add Orange home phone service to FreePBX as a VoIP trunk with simple steps...

22
Experimental
71 yuliu625/Yu-Data-Science-Toolkit

A modular data science toolkit for scientific research, featuring...

22
Experimental
72 virbahu/dmaic-toolkit

Lean Six Sigma DMAIC toolkit statistical tests

22
Experimental
73 ELHoussineT/AutoDataCleaner

Simple and automatic data cleaning in one line of code! It performs one-hot...

22
Experimental
74 iterative/example-gto

Get Started GTO Project

22
Experimental
75 KaziAmitHasan/data-inspector

Data Inspector is an open-source python library that brings 15++ types of...

22
Experimental
76 NimoKwarkye/stats_tool_repo

XploreML is node based application built with dearpygui. This application...

22
Experimental
77 sarwarbeing-ai/Scaler

Scaler:Study Materials for Data Science and Machine Learning

22
Experimental
78 LEL-A/GerAlpacaDataCleaned

German Alpaca Dataset (Cleaned + Translated)

20
Experimental
79 sturlese/numerai_signals_pipeline

Downloads data from Yahoo Finance, generates features, trains a model and...

20
Experimental
80 chiphuyen/metaflow-transformers-tutorials

Metaflow tutorials for ODSC West 2021

20
Experimental
81 berkaygediz/SolidSheets

📊 A modern spreadsheet editor with ML integration, supporting real-time...

17
Experimental
82 akashmi/ai-data-engineering-ecosystem-guide

A comprehensive reference guide mapping the entire AI, Machine Learning,...

17
Experimental
83 kjd-dktech/ml-data-analysis-pipeline

Analyse Exploratoire et Modélisation de Données – Cadre Académique

15
Experimental
84 darsh276/snowflake-mh9

❄️ Simplify data management with snowflake-mh9, a tool that streamlines...

15
Experimental
85 NERC-CEH/DSFP-PyExplorer

A Python package for doing exploratory data analysis of collections on the...

15
Experimental
86 AliAmini93/Data-Distribution-Finder

Developed a Windows-based app for analyzing data distributions and...

15
Experimental
87 RezaMoammadi/Book-Data-Science-R

If you're eager to explore data science, data analysis, and machine...

15
Experimental
88 SakuraPuare/AlibabaTrace

阿里集群数据集cluster-trace-v2018分析及可视化系统的设计与实现

14
Experimental
89 nguyencongtri/data12

🚀 Build scalable enterprise applications with a robust architecture that...

14
Experimental
90 jyhuang201900/Orange-Engine

Integrate the Orange Engine with ease using our free library and sample...

14
Experimental
91 garimamittal13/SMAI-M25

Data analysis, statistical modeling, clustering, forecasting, and deep...

14
Experimental
92 LTxYan/Data-Reliability-Noisy-Input-Handling-in-ML-Models

🔍 Analyze how noisy and incomplete data impacts machine learning model...

14
Experimental
93 PoojaSiv0211/DataGenome

Interactive dataset structure visualizer using correlation analysis,...

14
Experimental
94 sultanul-ovi/Alibaba-GPU-Cluster-Dataset-2025-Analysis

Detailed Analysis Traces for GPU-Disaggregated Deep Learning Recommendation Models

14
Experimental
95 mentoratechnologies/PurifyFactory-Beta

PurifyFactory v9.1.6 — Programma Beta Betatester

14
Experimental
96 fhswf/paper-mlwa-mlpro-2.0

Paper ScienceDirect MLWA - Arend e.a. - "MLPro 2.0 - Online machine learning...

12
Experimental
97 lamastex/ScaDaMaLe

Scalable Data Science and Distributed Machine Learning Course Book written...

12
Experimental
98 rimonim/ds4psych

Data Science for Psychology: Natural Language

12
Experimental
99 GZ30eee/DataVerse

DataVerse is an innovative platform that empowers users with advanced data...

11
Experimental
100 TamerDotWork/datapulse

DataPulse is an automated data clustering service that discovers optimal...

11
Experimental
101 Yogesh-Rebari/AutoCleanX

This project is built with a passion for cleaning the row data(eg. CSV...

11
Experimental
102 FixML/FixML_Paper

A repository for developing a paper focused on the FixML system.

10
Experimental

Comparisons in this category