weAIDB/awesome-data-llm
Official Repository of "LLM × DATA" Survey Paper
Organizes comprehensive research across three interconnected domains: data optimization for LLM training (via the IaaS framework addressing inclusiveness, abundance, articulation, and sanitization), LLM/Agent-as-Data-Analyst techniques spanning structured to heterogeneous data modalities, and LLM-enhanced data preparation workflows for cleaning, integration, and enrichment. Curates papers and methodologies covering the full LLM lifecycle—from pretraining and fine-tuning through RAG and agent systems—alongside data infrastructure concerns like deduplication, filtering, storage formats, and serving optimization. Synthesizes emerging paradigms around prompt-driven data workflows and agentic preparation systems alongside foundational data-centric approaches for scaling model performance.
740 stars. Actively maintained with 10 commits in the last 30 days.
Stars
740
Forks
66
Language
—
License
—
Category
Last pushed
Mar 05, 2026
Commits (30d)
10
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/weAIDB/awesome-data-llm"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
monarch-initiative/ontogpt
LLM-based ontological extraction tools, including SPIRES
open-chinese/poetry-collection
中文《诗歌总集》,距今为止最全面,最系统的中文诗词数据集,统一数据建模.
AXYZdong/AMchat
AM (Advanced Mathematics) Chat is a large language model that integrates advanced mathematical...
skywalker023/sodaverse
🥤🧑🏻🚀Code and dataset for our EMNLP 2023 paper - "SODA: Million-scale Dialogue Distillation with...
Jeryi-Sun/LLM-and-Law
This repository is dedicated to summarizing papers related to large language models with the field of law