llm-datasets and open-llm-datasets

These two tools are competitors, as both aim to provide curated lists of datasets for large language model post-training and research.

llm-datasets

Established

open-llm-datasets

Emerging

Maintenance 16/25

Adoption 10/25

Maturity 8/25

Community 19/25

Maintenance 0/25

Adoption 9/25

Maturity 16/25

Community 7/25

Stars: 4,319

Forks: 354

Downloads: —

Commits (30d): 2

Language: —

License: —

Stars: 101

Forks: 5

Downloads: —

Commits (30d): 0

Language: —

License: MIT

No License No Package No Dependents

Stale 6m No Package No Dependents

About llm-datasets

mlabonne/llm-datasets

Curated list of datasets and tools for post-training.

Organizes post-training datasets across specialized categories (instruction, math, code, reasoning) with quality criteria emphasizing accuracy, diversity, and complexity—evaluated through manual review, heuristics, and judge LLM scoring. Covers both human-curated and synthetically-generated datasets from academic and industry sources, with focus on permissive licensing. Integrates primarily with Hugging Face Hub for dataset hosting and targets the SFT/fine-tuning phase of LLM development pipelines.

About open-llm-datasets

dsdanielpark/open-llm-datasets

Repository for organizing datasets and papers used in Open LLM.

Scores updated daily from GitHub, PyPI, and npm data. How scores work