weAIDB/awesome-data-llm

Official Repository of "LLM × DATA" Survey Paper

/ 100

Established

Organizes comprehensive research across three interconnected domains: data optimization for LLM training (via the IaaS framework addressing inclusiveness, abundance, articulation, and sanitization), LLM/Agent-as-Data-Analyst techniques spanning structured to heterogeneous data modalities, and LLM-enhanced data preparation workflows for cleaning, integration, and enrichment. Curates papers and methodologies covering the full LLM lifecycle—from pretraining and fine-tuning through RAG and agent systems—alongside data infrastructure concerns like deduplication, filtering, storage formats, and serving optimization. Synthesizes emerging paradigms around prompt-driven data workflows and agentic preparation systems alongside foundational data-centric approaches for scaling model performance.

740 stars. Actively maintained with 10 commits in the last 30 days.

No License No Package No Dependents

Maintenance 17 / 25

Adoption 10 / 25

Maturity 8 / 25

Community 17 / 25

How are scores calculated?

Stars

740

Forks

Language

—

License

—

Related tools

monarch-initiative/ontogpt

LLM-based ontological extraction tools, including SPIRES

open-chinese/poetry-collection

中文《诗歌总集》，距今为止最全面，最系统的中文诗词数据集，统一数据建模.

AXYZdong/AMchat

AM (Advanced Mathematics) Chat is a large language model that integrates advanced mathematical...

skywalker023/sodaverse

🥤🧑🏻‍🚀Code and dataset for our EMNLP 2023 paper - "SODA: Million-scale Dialogue Distillation with...

Jeryi-Sun/LLM-and-Law

This repository is dedicated to summarizing papers related to large language models with the field of law

Explore LLM Tools

All categories Trending LLM Tool directory Insights