raunak-agarwal/instruction-datasets
Datasets for Instruction Tuning of Large Language Models
Curated index of 100+ instruction-tuning datasets spanning gold-standard human-annotated collections (P3, Natural Instructions v2, Open Assistant), LM-generated variants (Self-Instruct, Alpaca, ShareGPT), and preference datasets for reward model training (HH-RLHF, SHP). Covers multilingual and multimodal instruction data across 46+ languages and task-specific domains including web agents and code generation. Integrates with Hugging Face datasets hub and serves as a comprehensive reference for practitioners selecting training corpora across quality tiers.
261 stars. No commits in the last 6 months.
Stars
261
Forks
13
Language
—
License
—
Category
Last pushed
Nov 30, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/raunak-agarwal/instruction-datasets"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
MantisAI/sieves
Plug-and-play document AI with zero-shot models.
xiaoya-li/Instruction-Tuning-Survey
Project for the paper entitled `Instruction Tuning for Large Language Models: A Survey`
TencentARC-QQ/TagGPT
TagGPT: Large Language Models are Zero-shot Multimodal Taggers
LIN-SHANG/InstructERC
The offical realization of InstructERC
Lichang-Chen/InstructZero
Official Implementation of InstructZero; the first framework to optimize bad prompts of...