raunak-agarwal/instruction-datasets

Datasets for Instruction Tuning of Large Language Models

/ 100

Experimental

Curated index of 100+ instruction-tuning datasets spanning gold-standard human-annotated collections (P3, Natural Instructions v2, Open Assistant), LM-generated variants (Self-Instruct, Alpaca, ShareGPT), and preference datasets for reward model training (HH-RLHF, SHP). Covers multilingual and multimodal instruction data across 46+ languages and task-specific domains including web agents and code generation. Integrates with Hugging Face datasets hub and serves as a comprehensive reference for practitioners selecting training corpora across quality tiers.

261 stars. No commits in the last 6 months.

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 8 / 25

Community 10 / 25

How are scores calculated?

Stars

261

Forks

Language

—

License

—

Higher-rated alternatives

MantisAI/sieves

Plug-and-play document AI with zero-shot models.

xiaoya-li/Instruction-Tuning-Survey

Project for the paper entitled `Instruction Tuning for Large Language Models: A Survey`

TencentARC-QQ/TagGPT

TagGPT: Large Language Models are Zero-shot Multimodal Taggers

LIN-SHANG/InstructERC

The offical realization of InstructERC

Lichang-Chen/InstructZero

Official Implementation of InstructZero; the first framework to optimize bad prompts of...

Explore LLM Tools

All categories Trending LLM Tool directory Insights