Data Quality Preprocessing ML Frameworks
Tools and techniques for assessing, cleaning, and preparing datasets for machine learning. Includes data validation, outlier detection, missing value handling, and dataset quality frameworks. Does NOT include domain-specific cleaning (e.g., text-only or image-only), general data science tutorials without code frameworks, or downstream ML modeling tasks.
There are 102 data quality preprocessing frameworks tracked. 4 score above 70 (verified tier). The highest-rated is biolab/orange3 at 90/100 with 5,573 stars and 33,517 monthly downloads. 4 of the top 10 are actively maintained.
Get all 102 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=data-quality-preprocessing&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Framework | Score | Tier |
|---|---|---|---|
| 1 |
biolab/orange3
🍊 :bar_chart: :bulb: Orange: Interactive data analysis |
|
Verified |
| 2 |
skrub-data/skrub
Machine learning with dataframes |
|
Verified |
| 3 |
cleanlab/cleanlab
Cleanlab's open-source library is the standard data-centric AI package for... |
|
Verified |
| 4 |
root-project/root
The official repository for ROOT: analyzing, storing and visualizing big... |
|
Verified |
| 5 |
fbdesignpro/sweetviz
Visualize and compare datasets, target values and associations, with one... |
|
Established |
| 6 |
drivendataorg/deon
A command line tool to easily add an ethics checklist to your data science projects. |
|
Established |
| 7 |
deepnote/deepnote
Deepnote is a drop-in replacement for Jupyter with an AI-first design, sleek... |
|
Established |
| 8 |
JasonKessler/scattertext
Beautiful visualizations of how language differs among document types. |
|
Established |
| 9 |
deepnote/deepnote-toolkit
Essential Python toolkit for Deepnote environments |
|
Established |
| 10 |
rhiever/datacleaner
A Python tool that automatically cleans data sets and readies them for analysis. |
|
Established |
| 11 |
bodo-ai/PyDough
Analytics DSL for Python |
|
Established |
| 12 |
Renumics/spotlight
Interactively explore unstructured datasets from your dataframe. |
|
Established |
| 13 |
ShimantoRahman/empulse
Value-driven and cost-sensitive analysis for scikit-learn |
|
Established |
| 14 |
AutoViML/pandas_dq
Find data quality issues and clean your data in a single line of code with a... |
|
Established |
| 15 |
SERG-Delft/dslinter
`dslinter` is a pylint plugin for linting data science and machine learning... |
|
Emerging |
| 16 |
IRT-SystemX/dqm-ml
A library to compute data quality metrics |
|
Emerging |
| 17 |
Data-Centric-AI-Community/ydata-quality
Data Quality assessment with one line of code |
|
Emerging |
| 18 |
ml-tooling/ml-workspace
🛠 All-in-one web-based IDE specialized for machine learning and data science. |
|
Emerging |
| 19 |
PAIR-code/facets
Visualizations for machine learning datasets |
|
Emerging |
| 20 |
msamogh/nonechucks
Deal with bad samples in your dataset dynamically, use Transforms as... |
|
Emerging |
| 21 |
MPEDS/mpeds
Machine-learning Protest Event Data System |
|
Emerging |
| 22 |
COM6012/ScalableML
COM6012 Scalable Machine Learning - University of Sheffield. Enjoy our... |
|
Emerging |
| 23 |
scienxlab/redflag
Safety net for machine learning pipelines. Plays nice with sklearn and pandas. |
|
Emerging |
| 24 |
altermarkive/shrubbery
Numerai Experiments |
|
Emerging |
| 25 |
buabaj/xplore
A python package built for data scientist/analysts, AI/ML engineers for... |
|
Emerging |
| 26 |
Digital-Dermatology/SelfClean
[NeurIPS 2024] 🧼🔎 A holistic self-supervised data cleaning strategy to... |
|
Emerging |
| 27 |
gretl-project/gretl
Official mirror of the actively maintained repo on sourceforge |
|
Emerging |
| 28 |
Olow304/Data-Science-Machine-Learning
The overall objective of this toolkit is to provide and offer a free... |
|
Emerging |
| 29 |
JacksonBurns/astartes
Better Data Splits for Machine Learning |
|
Emerging |
| 30 |
Renumics/sliceguard
A library for detecting problematic data segments in structured and... |
|
Emerging |
| 31 |
matthewfeickert-talks/reproducible-ml-for-scientists-with-pixi-scipy-2025
SciPy 2025 tutorial on "Reproducible Machine Learning Workflows for... |
|
Emerging |
| 32 |
cssr-tools/ML_near_well
Runfiles for an ML near-well model and to reproduce results from the article... |
|
Emerging |
| 33 |
fusion-jena/MLProvLab
Provenance Management for Data Science Notebooks |
|
Emerging |
| 34 |
Safe-DS/Stub-Generator
Automated generation of Safe-DS stubs for Python libraries. |
|
Emerging |
| 35 |
Livingston-k/cleanPyData
cleanPyData is a Python package for data cleaning and preprocessing. It... |
|
Emerging |
| 36 |
PKNU-PR-ML-Lab/orange
오렌지로 쉽게 배우는 머신러닝과 데이터 분석 (오렌지3) |
|
Emerging |
| 37 |
pierpaolo28/Data-Visualization
Collection of interactive Jupiter Notebook widgets and graphs. |
|
Emerging |
| 38 |
genular/pandora
PANDORA :computer: |
|
Emerging |
| 39 |
France-Travail/gabarit
Gabarit : kickstart your data science project from scratch |
|
Emerging |
| 40 |
HelikarLab/candis
:ribbon: A data mining suite for gene expression data. |
|
Emerging |
| 41 |
microsoft/Data-Discovery-Toolkit
A data discovery and manipulation toolset for unstructured data |
|
Emerging |
| 42 |
synapticore-io/marimo-flow
Interactive ML notebooks with reactive updates, AI assistance, and MLflow tracking |
|
Emerging |
| 43 |
cdr-book/cdr-book.github.io
Repository for the website of the book (github hosting support) |
|
Emerging |
| 44 |
HazyResearch/meerkat
Explore and understand your training and validation data. |
|
Emerging |
| 45 |
ThomasWong2022/numerai-benchmark
Python Code used in publications, for archival purposes only |
|
Emerging |
| 46 |
khuyentran1401/reproducible-data-science
Tutorials on creating a reproducible and maintainable data science project |
|
Emerging |
| 47 |
seedatnabeel/Data-IQ
Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular... |
|
Emerging |
| 48 |
pmaji/data-science-toolkit
Collection of stats, modeling, and data science tools in Python and R. |
|
Emerging |
| 49 |
BinaryResearch/centrifuge-toolkit
Tool for visualizing and empirically analyzing information encoded in binary files |
|
Emerging |
| 50 |
dirty-data-science/python
Tutorial material on machine learning with dirty data in Python |
|
Emerging |
| 51 |
CharlesAverill/satyrn
A Notebook alternative that supports branching code and local collaboration. |
|
Emerging |
| 52 |
councilofelders/numereval
A small library to locally calculate the scores on numer.ai tournament's... |
|
Experimental |
| 53 |
seedatnabeel/Data-SUITE
Data-SUITE: Data-centric identification of in-distribution incongruous... |
|
Experimental |
| 54 |
awojinrin/ML-Workflow-for-the-Determination-of-Hole-Cleaning-Conditions
A repo containing Jupyter notebooks where ensemble algorithms are... |
|
Experimental |
| 55 |
fuseml/examples
A collection of machine learning projects serving as sample applications... |
|
Experimental |
| 56 |
gianlucatruda/numerai
Quant. trading with ML on Numerai |
|
Experimental |
| 57 |
ahmedshahriar/PulsePoint-Data-Analytics
EDA, data processing, cleaning and extensive geospatial analysis on a... |
|
Experimental |
| 58 |
sumanthprabhu/DQC-Toolkit
Quality Checks for Training Data in Machine Learning |
|
Experimental |
| 59 |
adipolak/scaling-machine-learning-course
Scaling Machine Learning in Three Week course in a collaboration with... |
|
Experimental |
| 60 |
numerai/signals-example-scripts
The official example scripts for the Numerai Signals Data Science Tournament |
|
Experimental |
| 61 |
sultanul-ovi/GPU-Cluster-Spot-Resource-Dataset-Analysis
Detailed Analysis Traces for AI jobs leveraging spot GPU resources |
|
Experimental |
| 62 |
NatanMish/data_validation
Tutorial for implementing data validation in data science pipelines |
|
Experimental |
| 63 |
giagiannis/data-profiler
Data profiler is an attempt to model the behavior of a given operator for a... |
|
Experimental |
| 64 |
s-kav/ds_tools
Library consisting of additional & helpful functions for data science research stages |
|
Experimental |
| 65 |
Diogolsn10/statistical-analysis
Provide well-documented statistical analysis tools in Python, R, and Stata... |
|
Experimental |
| 66 |
galafis/awesome-data-science-toolkit
🚀 Comprehensive toolkit for data scientists with Python utilities, ML... |
|
Experimental |
| 67 |
sbettid/GPSClean
An application to correct a GPS trace using machine learning techniques. To... |
|
Experimental |
| 68 |
Vis4Sense/ml-prov-binder
The code for running our Jupyter Lab extension on https://mybinder.org/ |
|
Experimental |
| 69 |
pawlyk/dsml-tools
set of Data Science and Machine Learning tools |
|
Experimental |
| 70 |
vyshakA/Orange-VoIP-FreePBX-Trunk
📞 Add Orange home phone service to FreePBX as a VoIP trunk with simple steps... |
|
Experimental |
| 71 |
yuliu625/Yu-Data-Science-Toolkit
A modular data science toolkit for scientific research, featuring... |
|
Experimental |
| 72 |
virbahu/dmaic-toolkit
Lean Six Sigma DMAIC toolkit statistical tests |
|
Experimental |
| 73 |
ELHoussineT/AutoDataCleaner
Simple and automatic data cleaning in one line of code! It performs one-hot... |
|
Experimental |
| 74 |
iterative/example-gto
Get Started GTO Project |
|
Experimental |
| 75 |
KaziAmitHasan/data-inspector
Data Inspector is an open-source python library that brings 15++ types of... |
|
Experimental |
| 76 |
NimoKwarkye/stats_tool_repo
XploreML is node based application built with dearpygui. This application... |
|
Experimental |
| 77 |
sarwarbeing-ai/Scaler
Scaler:Study Materials for Data Science and Machine Learning |
|
Experimental |
| 78 |
LEL-A/GerAlpacaDataCleaned
German Alpaca Dataset (Cleaned + Translated) |
|
Experimental |
| 79 |
sturlese/numerai_signals_pipeline
Downloads data from Yahoo Finance, generates features, trains a model and... |
|
Experimental |
| 80 |
chiphuyen/metaflow-transformers-tutorials
Metaflow tutorials for ODSC West 2021 |
|
Experimental |
| 81 |
berkaygediz/SolidSheets
📊 A modern spreadsheet editor with ML integration, supporting real-time... |
|
Experimental |
| 82 |
akashmi/ai-data-engineering-ecosystem-guide
A comprehensive reference guide mapping the entire AI, Machine Learning,... |
|
Experimental |
| 83 |
kjd-dktech/ml-data-analysis-pipeline
Analyse Exploratoire et Modélisation de Données – Cadre Académique |
|
Experimental |
| 84 |
darsh276/snowflake-mh9
❄️ Simplify data management with snowflake-mh9, a tool that streamlines... |
|
Experimental |
| 85 |
NERC-CEH/DSFP-PyExplorer
A Python package for doing exploratory data analysis of collections on the... |
|
Experimental |
| 86 |
AliAmini93/Data-Distribution-Finder
Developed a Windows-based app for analyzing data distributions and... |
|
Experimental |
| 87 |
RezaMoammadi/Book-Data-Science-R
If you're eager to explore data science, data analysis, and machine... |
|
Experimental |
| 88 |
SakuraPuare/AlibabaTrace
阿里集群数据集cluster-trace-v2018分析及可视化系统的设计与实现 |
|
Experimental |
| 89 |
nguyencongtri/data12
🚀 Build scalable enterprise applications with a robust architecture that... |
|
Experimental |
| 90 |
jyhuang201900/Orange-Engine
Integrate the Orange Engine with ease using our free library and sample... |
|
Experimental |
| 91 |
garimamittal13/SMAI-M25
Data analysis, statistical modeling, clustering, forecasting, and deep... |
|
Experimental |
| 92 |
LTxYan/Data-Reliability-Noisy-Input-Handling-in-ML-Models
🔍 Analyze how noisy and incomplete data impacts machine learning model... |
|
Experimental |
| 93 |
PoojaSiv0211/DataGenome
Interactive dataset structure visualizer using correlation analysis,... |
|
Experimental |
| 94 |
sultanul-ovi/Alibaba-GPU-Cluster-Dataset-2025-Analysis
Detailed Analysis Traces for GPU-Disaggregated Deep Learning Recommendation Models |
|
Experimental |
| 95 |
mentoratechnologies/PurifyFactory-Beta
PurifyFactory v9.1.6 — Programma Beta Betatester |
|
Experimental |
| 96 |
fhswf/paper-mlwa-mlpro-2.0
Paper ScienceDirect MLWA - Arend e.a. - "MLPro 2.0 - Online machine learning... |
|
Experimental |
| 97 |
lamastex/ScaDaMaLe
Scalable Data Science and Distributed Machine Learning Course Book written... |
|
Experimental |
| 98 |
rimonim/ds4psych
Data Science for Psychology: Natural Language |
|
Experimental |
| 99 |
GZ30eee/DataVerse
DataVerse is an innovative platform that empowers users with advanced data... |
|
Experimental |
| 100 |
TamerDotWork/datapulse
DataPulse is an automated data clustering service that discovers optimal... |
|
Experimental |
| 101 |
Yogesh-Rebari/AutoCleanX
This project is built with a passion for cleaning the row data(eg. CSV... |
|
Experimental |
| 102 |
FixML/FixML_Paper
A repository for developing a paper focused on the FixML system. |
|
Experimental |