Llm Domain Datasets Transformer Models

There are 16 llm domain datasets models tracked. 1 score above 50 (established tier). The highest-rated is mlabonne/llm-datasets at 53/100 with 4,319 stars. 1 of the top 10 are actively maintained.

Get all 16 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-domain-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	mlabonne/llm-datasets Curated list of datasets and tools for post-training.	53	Established	4,319	—
2	malteos/llm-datasets A collection of datasets for language model pretraining including scripts...	48	Emerging	64	Python
3	magpie-align/magpie [ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs...	43	Emerging	834	Python
4	willxxy/ECG-Bench A Unified Framework for Benchmarking Generative Electrocardiogram-Language...	41	Emerging	42	Python
5	geobrain-ai/geogalactica Code and datasets for paper "GeoGalactica: A Scientific Large Language Model...	40	Emerging	40	Python
6	HaoAreYuDong/MachineLearningLM Scaling In-context Learning from Few-shot to 1,024-shot on Tabular ML	34	Emerging	59	Python
7	dsdanielpark/open-llm-datasets Repository for organizing datasets and papers used in Open LLM.	32	Emerging	101	—
8	asimsinan/LLM-Research A collection of LLM related papers, thesis, tools, datasets, courses, open...	30	Emerging	62	Python
9	seedatnabeel/CLLM Curated LLM (ICML 2024)	29	Experimental	14	Jupyter Notebook
10	shahriargolchin/time-travel-in-llms The official repository for the paper entitled "Time Travel in LLMs: Tracing...	29	Experimental	12	Python
11	artpli/CodeIE [ACL 23] CodeIE: Large Code Generation Models are Better Few-Shot...	24	Experimental	40	Python
12	sodascience/social_science_inferences_with_llms Addressing LLM-related measurement error in social science modeling research.	23	Experimental	10	—
13	OSU-NLP-Group/LLM-IOAA Code and data for the paper "Large Language Models Achieve Gold Medal...	22	Experimental	17	TeX
14	mahadi-nahid/TabSQLify [NAACL 2024] TabSQLify: Enhancing Reasoning Capabilities of LLMs Through...	22	Experimental	17	Python
15	rmovva/LLM-publication-patterns-public [NAACL 2024] Topics, Authors, and Institutions in Large Language Model...	15	Experimental	17	Jupyter Notebook
16	vicgalle/distilled-self-critique distilled Self-Critique refines the outputs of a LLM with only synthetic data	14	Experimental	11	Jupyter Notebook

Comparisons in this category

llm-datasets and open-llm-datasets (53 vs 32)