llm-datasets and open-llm-datasets
These two tools are competitors, as both aim to provide curated lists of datasets for large language model post-training and research.
About llm-datasets
mlabonne/llm-datasets
Curated list of datasets and tools for post-training.
Organizes post-training datasets across specialized categories (instruction, math, code, reasoning) with quality criteria emphasizing accuracy, diversity, and complexity—evaluated through manual review, heuristics, and judge LLM scoring. Covers both human-curated and synthetically-generated datasets from academic and industry sources, with focus on permissive licensing. Integrates primarily with Hugging Face Hub for dataset hosting and targets the SFT/fine-tuning phase of LLM development pipelines.
About open-llm-datasets
dsdanielpark/open-llm-datasets
Repository for organizing datasets and papers used in Open LLM.
Scores updated daily from GitHub, PyPI, and npm data. How scores work