raunak-agarwal/instruction-datasets

Datasets for Instruction Tuning of Large Language Models

28
/ 100
Experimental

Curated index of 100+ instruction-tuning datasets spanning gold-standard human-annotated collections (P3, Natural Instructions v2, Open Assistant), LM-generated variants (Self-Instruct, Alpaca, ShareGPT), and preference datasets for reward model training (HH-RLHF, SHP). Covers multilingual and multimodal instruction data across 46+ languages and task-specific domains including web agents and code generation. Integrates with Hugging Face datasets hub and serves as a comprehensive reference for practitioners selecting training corpora across quality tiers.

261 stars. No commits in the last 6 months.

No License Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 8 / 25
Community 10 / 25

How are scores calculated?

Stars

261

Forks

13

Language

License

Last pushed

Nov 30, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/raunak-agarwal/instruction-datasets"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.