Open Dataset Collections ML Frameworks
Curated repositories and directories that aggregate, catalog, or provide access to multiple datasets across various domains. Does NOT include individual datasets, dataset generation tools, or domain-specific dataset papers.
There are 60 open dataset collections frameworks tracked. 2 score above 70 (verified tier). The highest-rated is open-edge-platform/datumaro at 86/100 with 661 stars and 24,334 monthly downloads. 2 of the top 10 are actively maintained.
Get all 60 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=open-dataset-collections&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Framework | Score | Tier |
|---|---|---|---|
| 1 |
open-edge-platform/datumaro
Dataset Management Framework, a Python library and a CLI tool to build,... |
|
Verified |
| 2 |
webdataset/webdataset
A high-performance Python-based I/O system for large (and small) deep... |
|
Verified |
| 3 |
tensorflow/datasets
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... |
|
Established |
| 4 |
explosion/ml-datasets
🌊 Machine learning dataset loaders for testing and example scripts |
|
Established |
| 5 |
alan-turing-institute/CleverCSV
CleverCSV is a Python package for handling messy CSV files. It provides a... |
|
Established |
| 6 |
JovianHQ/opendatasets
A Python library for downloading datasets from Kaggle, Google Drive, and... |
|
Established |
| 7 |
mlcommons/croissant
Croissant is a high-level format for machine learning datasets that brings... |
|
Emerging |
| 8 |
opengeos/aws-open-data
A list of open datasets on AWS |
|
Emerging |
| 9 |
benedekrozemberczki/datasets
A repository of pretty cool datasets that I collected for network science... |
|
Emerging |
| 10 |
Pinak-Datta/wiz-craft
A CLI-based dataset preprocessing tool for machine learning tasks. Features... |
|
Emerging |
| 11 |
src-d/datasets
source{d} datasets ("big code") for source code analysis and machine... |
|
Emerging |
| 12 |
foorilla/ai-jobs-net-salaries
A dataset of global salaries in AI/ML and Big Data. |
|
Emerging |
| 13 |
packing-box/python-dsff
DataSet File Format (DSFF) |
|
Emerging |
| 14 |
unsplash/datasets
🎁 6,500,000+ Unsplash images made available for research and machine learning |
|
Emerging |
| 15 |
jbrownlee/Datasets
Machine learning datasets used in tutorials on MachineLearningMastery.com |
|
Emerging |
| 16 |
cleanlab/label-errors
🛠️ Corrected Test Sets for ImageNet, MNIST, CIFAR, Caltech-256, QuickDraw,... |
|
Emerging |
| 17 |
BaumSebastian/DDACS
Python interface for the DDACS dataset: 32K+ deep drawing simulations with... |
|
Emerging |
| 18 |
CYang828/datasetstation
快速下载中文数据集,处理数据集,数据分析、可视化分析,一站式解决数据问题 |
|
Emerging |
| 19 |
osdg-ai/osdg-data
The OSDG Community Dataset (OSDG-CD) is a public dataset of thousands of... |
|
Emerging |
| 20 |
samz5320/Data4ALL
A spot for all the datasets you need. |
|
Emerging |
| 21 |
SigmaJahan/Textual-Dissimilarity-Analysis-for-Duplicate-Bug-Report-Detection
We conduct a large-scale empirical study to understand better the impacts of... |
|
Emerging |
| 22 |
Vatshayan/Data-sets
Different Data-set on various Important topic on Real-world Problems |
|
Emerging |
| 23 |
anwielts/datasheet-for-dataset
Automatically create standardized documentation for the dataset used in your... |
|
Experimental |
| 24 |
Intelligent-CAT-Lab/SEER
Artifact repository for the paper "Perfect Is the Enemy of Test Oracle", In... |
|
Experimental |
| 25 |
yongfanbeta/Open-Access-Medical-Data
A list of Open-Access-Medical-Data(OAMD) commonly used in medical research |
|
Experimental |
| 26 |
seart-group/DL4SE
Building Training Datasets for Deep Learning Models in Software Engineering... |
|
Experimental |
| 27 |
shreyashankar/datasets-for-good
List of datasets to apply stats/machine learning/technology to the world of... |
|
Experimental |
| 28 |
fossology/Minerva-Dataset-Generation
Validated dataset generation using regex along with NLP Algorithms. |
|
Experimental |
| 29 |
AdaptInfer/CompBioDatasetsForMachineLearning
A Curated List of Computational Biology Datasets Suitable for Machine Learning |
|
Experimental |
| 30 |
simula/datasets.simula.no
Public datasets published by Simula. |
|
Experimental |
| 31 |
lennox55555/Savvy-CSV
Savvy CSV is an web application designed to effortlessly create the ideal... |
|
Experimental |
| 32 |
modelset/modelset-dataset
ModelSet is a labelled dataset of Ecore and UML models |
|
Experimental |
| 33 |
DagsHub/3D-model-datasets
Open-source 3D Model datasets |
|
Experimental |
| 34 |
incubrain/awesome-maharashtra-data
A collection of datasets specific to Maharashtra, India. WIP |
|
Experimental |
| 35 |
asampat3090/open-datasets
Running list of Open Datasets |
|
Experimental |
| 36 |
DagsHub/open-source-ml-datasets
This repository holds open source datasets for various machine learning... |
|
Experimental |
| 37 |
mdrmdmau/datasets
📊 Gather and share open datasets for the Indonesian physics community,... |
|
Experimental |
| 38 |
ZamAI-ORG/mt5-pashto
Pashto-focused work with mT5 (experiments, fine-tuning, references) in ZamAI Labs. |
|
Experimental |
| 39 |
ZamAI-ORG/pashto-datasets
Curated and processed Pashto datasets for ZamAI Labs (with source... |
|
Experimental |
| 40 |
salesforce/iSEA
Official code repository for "iSEA: An Interactive Pipeline of Semantic... |
|
Experimental |
| 41 |
ZamAI-ORG/training-spaces
Reusable training and experiment spaces for ZamAI Labs (templates, scripts,... |
|
Experimental |
| 42 |
samuelmcnair33/Samuel-McNair-Dataset
Personal dataset released under CC0 license |
|
Experimental |
| 43 |
AhmedBella/World-Dataset-Library
A Django/React website for sharing datasets - It seems that we got beaten to... |
|
Experimental |
| 44 |
MainakVerse/Datasets
List of ready to use datasets for your projects |
|
Experimental |
| 45 |
skforecast/skforecast-datasets
This repository contains datasets used in the skforecast library. It also... |
|
Experimental |
| 46 |
autonlab/aqua
AQuA: A Benchmarking Tool for Label Quality Assessment, NeurIPS'23 D&B |
|
Experimental |
| 47 |
QQQHY/Medical-Datasets-for-Machine-Learning
Medical Datasets for Machine Learning 机器学习医学数据 |
|
Experimental |
| 48 |
VLa-Labs/Swedish-Language-Dataset-List
🇸🇪 A curated collection of 38 public Swedish language datasets metadata... |
|
Experimental |
| 49 |
ZamAI-ORG/zamai-models
Model artifacts and experiments published by ZamAI Labs (training results... |
|
Experimental |
| 50 |
ZamAI-ORG/labs
ZamAI Labs — datasets, Pashto processing, models, and training pipelines... |
|
Experimental |
| 51 |
mlnjsh/EIF-Training-Datasets
📁 Curated Training Datasets — Clean, labeled datasets for ML/AI coursework... |
|
Experimental |
| 52 |
aoerecinfo/aoe2dataset
Age of Empires II Definitive Edition rec analysis dataset |
|
Experimental |
| 53 |
lucien1011/MoonBoard-Route
Dataset for Moonboard routes with 2016 and 2017 setting, scraped in 2018. |
|
Experimental |
| 54 |
lkkhwhb/TrainingData
This repository stores and organizes training data folders for machine... |
|
Experimental |
| 55 |
coneco-lab/open-lab-toolkit
A collection of CoN&Co Lab's main software tools |
|
Experimental |
| 56 |
Knodl-LLC/KnoDL-Match
Service for automatic matching two data sets without mapping |
|
Experimental |
| 57 |
JeremGamingYT/TrainAIDatasets
This is an AI dataset project with over 10,000-100,000 pieces of data! |
|
Experimental |
| 58 |
komal11lamba/dataset_komal
my dataset collection |
|
Experimental |
| 59 |
vivesweb/csv_pair_file
Manage csv pair files for Machine Learning |
|
Experimental |
| 60 |
serval-uni-lu/The_dataset_of_large_case_studies_on_mutants_similarity_with_bugs
The dataset of large case studies on mutants similarity, measured both... |
|
Experimental |