Open Dataset Collections ML Frameworks

Curated repositories and directories that aggregate, catalog, or provide access to multiple datasets across various domains. Does NOT include individual datasets, dataset generation tools, or domain-specific dataset papers.

There are 60 open dataset collections frameworks tracked. 2 score above 70 (verified tier). The highest-rated is open-edge-platform/datumaro at 86/100 with 661 stars and 24,334 monthly downloads. 2 of the top 10 are actively maintained.

Get all 60 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=open-dataset-collections&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Framework Score Tier
1 open-edge-platform/datumaro

Dataset Management Framework, a Python library and a CLI tool to build,...

86
Verified
2 webdataset/webdataset

A high-performance Python-based I/O system for large (and small) deep...

79
Verified
3 tensorflow/datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...

67
Established
4 explosion/ml-datasets

🌊 Machine learning dataset loaders for testing and example scripts

67
Established
5 alan-turing-institute/CleverCSV

CleverCSV is a Python package for handling messy CSV files. It provides a...

64
Established
6 JovianHQ/opendatasets

A Python library for downloading datasets from Kaggle, Google Drive, and...

54
Established
7 mlcommons/croissant

Croissant is a high-level format for machine learning datasets that brings...

49
Emerging
8 opengeos/aws-open-data

A list of open datasets on AWS

45
Emerging
9 benedekrozemberczki/datasets

A repository of pretty cool datasets that I collected for network science...

44
Emerging
10 Pinak-Datta/wiz-craft

A CLI-based dataset preprocessing tool for machine learning tasks. Features...

43
Emerging
11 src-d/datasets

source{d} datasets ("big code") for source code analysis and machine...

42
Emerging
12 foorilla/ai-jobs-net-salaries

A dataset of global salaries in AI/ML and Big Data.

41
Emerging
13 packing-box/python-dsff

DataSet File Format (DSFF)

40
Emerging
14 unsplash/datasets

🎁 6,500,000+ Unsplash images made available for research and machine learning

38
Emerging
15 jbrownlee/Datasets

Machine learning datasets used in tutorials on MachineLearningMastery.com

36
Emerging
16 cleanlab/label-errors

🛠️ Corrected Test Sets for ImageNet, MNIST, CIFAR, Caltech-256, QuickDraw,...

35
Emerging
17 BaumSebastian/DDACS

Python interface for the DDACS dataset: 32K+ deep drawing simulations with...

35
Emerging
18 CYang828/datasetstation

快速下载中文数据集,处理数据集,数据分析、可视化分析,一站式解决数据问题

35
Emerging
19 osdg-ai/osdg-data

The OSDG Community Dataset (OSDG-CD) is a public dataset of thousands of...

33
Emerging
20 samz5320/Data4ALL

A spot for all the datasets you need.

31
Emerging
21 SigmaJahan/Textual-Dissimilarity-Analysis-for-Duplicate-Bug-Report-Detection

We conduct a large-scale empirical study to understand better the impacts of...

30
Emerging
22 Vatshayan/Data-sets

Different Data-set on various Important topic on Real-world Problems

30
Emerging
23 anwielts/datasheet-for-dataset

Automatically create standardized documentation for the dataset used in your...

29
Experimental
24 Intelligent-CAT-Lab/SEER

Artifact repository for the paper "Perfect Is the Enemy of Test Oracle", In...

29
Experimental
25 yongfanbeta/Open-Access-Medical-Data

A list of Open-Access-Medical-Data(OAMD) commonly used in medical research

29
Experimental
26 seart-group/DL4SE

Building Training Datasets for Deep Learning Models in Software Engineering...

29
Experimental
27 shreyashankar/datasets-for-good

List of datasets to apply stats/machine learning/technology to the world of...

29
Experimental
28 fossology/Minerva-Dataset-Generation

Validated dataset generation using regex along with NLP Algorithms.

28
Experimental
29 AdaptInfer/CompBioDatasetsForMachineLearning

A Curated List of Computational Biology Datasets Suitable for Machine Learning

28
Experimental
30 simula/datasets.simula.no

Public datasets published by Simula.

28
Experimental
31 lennox55555/Savvy-CSV

Savvy CSV is an web application designed to effortlessly create the ideal...

27
Experimental
32 modelset/modelset-dataset

ModelSet is a labelled dataset of Ecore and UML models

26
Experimental
33 DagsHub/3D-model-datasets

Open-source 3D Model datasets

24
Experimental
34 incubrain/awesome-maharashtra-data

A collection of datasets specific to Maharashtra, India. WIP

23
Experimental
35 asampat3090/open-datasets

Running list of Open Datasets

23
Experimental
36 DagsHub/open-source-ml-datasets

This repository holds open source datasets for various machine learning...

23
Experimental
37 mdrmdmau/datasets

📊 Gather and share open datasets for the Indonesian physics community,...

23
Experimental
38 ZamAI-ORG/mt5-pashto

Pashto-focused work with mT5 (experiments, fine-tuning, references) in ZamAI Labs.

22
Experimental
39 ZamAI-ORG/pashto-datasets

Curated and processed Pashto datasets for ZamAI Labs (with source...

22
Experimental
40 salesforce/iSEA

Official code repository for "iSEA: An Interactive Pipeline of Semantic...

22
Experimental
41 ZamAI-ORG/training-spaces

Reusable training and experiment spaces for ZamAI Labs (templates, scripts,...

22
Experimental
42 samuelmcnair33/Samuel-McNair-Dataset

Personal dataset released under CC0 license

22
Experimental
43 AhmedBella/World-Dataset-Library

A Django/React website for sharing datasets - It seems that we got beaten to...

22
Experimental
44 MainakVerse/Datasets

List of ready to use datasets for your projects

20
Experimental
45 skforecast/skforecast-datasets

This repository contains datasets used in the skforecast library. It also...

20
Experimental
46 autonlab/aqua

AQuA: A Benchmarking Tool for Label Quality Assessment, NeurIPS'23 D&B

19
Experimental
47 QQQHY/Medical-Datasets-for-Machine-Learning

Medical Datasets for Machine Learning 机器学习医学数据

18
Experimental
48 VLa-Labs/Swedish-Language-Dataset-List

🇸🇪 A curated collection of 38 public Swedish language datasets metadata...

18
Experimental
49 ZamAI-ORG/zamai-models

Model artifacts and experiments published by ZamAI Labs (training results...

14
Experimental
50 ZamAI-ORG/labs

ZamAI Labs — datasets, Pashto processing, models, and training pipelines...

14
Experimental
51 mlnjsh/EIF-Training-Datasets

📁 Curated Training Datasets — Clean, labeled datasets for ML/AI coursework...

14
Experimental
52 aoerecinfo/aoe2dataset

Age of Empires II Definitive Edition rec analysis dataset

13
Experimental
53 lucien1011/MoonBoard-Route

Dataset for Moonboard routes with 2016 and 2017 setting, scraped in 2018.

12
Experimental
54 lkkhwhb/TrainingData

This repository stores and organizes training data folders for machine...

11
Experimental
55 coneco-lab/open-lab-toolkit

A collection of CoN&Co Lab's main software tools

11
Experimental
56 Knodl-LLC/KnoDL-Match

Service for automatic matching two data sets without mapping

10
Experimental
57 JeremGamingYT/TrainAIDatasets

This is an AI dataset project with over 10,000-100,000 pieces of data!

10
Experimental
58 komal11lamba/dataset_komal

my dataset collection

10
Experimental
59 vivesweb/csv_pair_file

Manage csv pair files for Machine Learning

10
Experimental
60 serval-uni-lu/The_dataset_of_large_case_studies_on_mutants_similarity_with_bugs

The dataset of large case studies on mutants similarity, measured both...

10
Experimental