mlabonne/llm-datasets
Curated list of datasets and tools for post-training.
Organizes post-training datasets across specialized categories (instruction, math, code, reasoning) with quality criteria emphasizing accuracy, diversity, and complexity—evaluated through manual review, heuristics, and judge LLM scoring. Covers both human-curated and synthetically-generated datasets from academic and industry sources, with focus on permissive licensing. Integrates primarily with Hugging Face Hub for dataset hosting and targets the SFT/fine-tuning phase of LLM development pipelines.
4,319 stars. Actively maintained with 2 commits in the last 30 days.
Stars
4,319
Forks
354
Language
—
License
—
Category
Last pushed
Mar 09, 2026
Commits (30d)
2
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/mlabonne/llm-datasets"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
malteos/llm-datasets
A collection of datasets for language model pretraining including scripts for downloading,...
magpie-align/magpie
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your...
willxxy/ECG-Bench
A Unified Framework for Benchmarking Generative Electrocardiogram-Language Models (ELMs)
geobrain-ai/geogalactica
Code and datasets for paper "GeoGalactica: A Scientific Large Language Model in Geoscience"
seedatnabeel/CLLM
Curated LLM (ICML 2024)