llm-datasets and open-llm-datasets

These two tools are competitors, as both aim to provide curated lists of datasets for large language model post-training and research.

llm-datasets
53
Established
open-llm-datasets
32
Emerging
Maintenance 16/25
Adoption 10/25
Maturity 8/25
Community 19/25
Maintenance 0/25
Adoption 9/25
Maturity 16/25
Community 7/25
Stars: 4,319
Forks: 354
Downloads:
Commits (30d): 2
Language:
License:
Stars: 101
Forks: 5
Downloads:
Commits (30d): 0
Language:
License: MIT
No License No Package No Dependents
Stale 6m No Package No Dependents

About llm-datasets

mlabonne/llm-datasets

Curated list of datasets and tools for post-training.

Organizes post-training datasets across specialized categories (instruction, math, code, reasoning) with quality criteria emphasizing accuracy, diversity, and complexity—evaluated through manual review, heuristics, and judge LLM scoring. Covers both human-curated and synthetically-generated datasets from academic and industry sources, with focus on permissive licensing. Integrates primarily with Hugging Face Hub for dataset hosting and targets the SFT/fine-tuning phase of LLM development pipelines.

About open-llm-datasets

dsdanielpark/open-llm-datasets

Repository for organizing datasets and papers used in Open LLM.

Scores updated daily from GitHub, PyPI, and npm data. How scores work