Synthetic Data Generation LLM Tools
Tools for generating synthetic datasets and training data for LLMs through various methods (QA pairs, tabular data, code, structured extraction). Does NOT include general data processing, data augmentation for images, or dataset annotation/curation platforms.
There are 42 synthetic data generation tools tracked. 3 score above 50 (established tier). The highest-rated is InternScience/GraphGen at 56/100 with 978 stars. 1 of the top 10 are actively maintained.
Get all 42 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=synthetic-data-generation&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
InternScience/GraphGen
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven... |
|
Established |
| 2 |
rasinmuhammed/misata
High-performance open-source synthetic data engine. Uses LLMs for schema... |
|
Established |
| 3 |
timothepearce/synda
A CLI for generating synthetic data |
|
Established |
| 4 |
dmanuel64/codablellm
A framework for creating and curating high-quality code datasets tailored... |
|
Emerging |
| 5 |
ZhuLinsen/FastDatasets
A powerful tool for creating high-quality training datasets for Large... |
|
Emerging |
| 6 |
ziegler-ingo/CRAFT
[TACL, EMNLP 2025 Oral] Code, datasets, and checkpoints for the paper "CRAFT... |
|
Emerging |
| 7 |
BatsResearch/bonito
A lightweight library for generating synthetic instruction tuning datasets... |
|
Emerging |
| 8 |
Oqura-ai/deepresearch-datagen-cli
Using deep research workflow to generate datasets for finetuning LLMs. |
|
Emerging |
| 9 |
Alannikos/edg4llm
A unified tool to generate fine-tuning datasets for LLMs, including... |
|
Emerging |
| 10 |
asaparov/prontoqa
Synthetic question-answering dataset to formally analyze the... |
|
Emerging |
| 11 |
nalinrajendran/synthetic-LLM-QA-dataset-generator
Create synthetic datasets for training and testing Language Learning Models... |
|
Emerging |
| 12 |
Itachi-Uchiha581/Auto-Data
Auto Data is a library designed for quick and effortless creation of... |
|
Experimental |
| 13 |
ISE-FIZKarlsruhe/concept_extraction
ConExion |
|
Experimental |
| 14 |
GURPREETKAURJETHRA/Synthetic-Data-Generation-using-LLM
Synthetic Data Generation using LLM via Argilla, Distilabel, ChatGPT, etc. |
|
Experimental |
| 15 |
kevinscaria/TarGEN
Targeted Data Generation with Large Language Models |
|
Experimental |
| 16 |
Glavin001/Data2AITextbook
🚀 Automatically convert unstructured data into a high-quality 'textbook'... |
|
Experimental |
| 17 |
BothBosu/Synthetic-Data-for-Scam-Detection-Leveraging-LLMs-to-Train-Deep-Learning-Models
This repository contains the source code and synthetic datasets used in the... |
|
Experimental |
| 18 |
copyleftdev/faux-foundry
FauxFoundry - Synthetic data generation powered by local LLMs |
|
Experimental |
| 19 |
danmurf/datakeg
Brew synthetic training data from your documentation using LLMs |
|
Experimental |
| 20 |
Red1998/faux-foundry
🤖 Generate unique synthetic datasets effortlessly with FauxFoundry, using... |
|
Experimental |
| 21 |
dmeldrum6/synthetic-dataset
Web based tool for generating Q&A datasets from an LLM |
|
Experimental |
| 22 |
jehumtine/synthetic_data_generator
This script is designed to convert bodies of text into a question and answer... |
|
Experimental |
| 23 |
rodrigobnogueira/faker-ai-provider
🤖 Faker provider for generating AI/ML fake data - models, companies,... |
|
Experimental |
| 24 |
MelNajkar/llm-data-augmentation-sentiment
LLM-based synthetic data generation for improving sentiment classification... |
|
Experimental |
| 25 |
jqwangai/SynPT
An Improved Data Synthesis Method Driven by Large Language Models for... |
|
Experimental |
| 26 |
Pro-GenAI/DataClassifier
An AI-driven approach to Label LLM Training Data |
|
Experimental |
| 27 |
yzhan238/TELEClass
The source code used for paper "TELEClass: Taxonomy Enrichment and... |
|
Experimental |
| 28 |
nphdang/Pred-LLM
Generating tabular data via Large Language Models (LLMs) |
|
Experimental |
| 29 |
aekpalakorn/TGRE-Classification
Source code for the taxonomic knowledge assessment and occupation/skill... |
|
Experimental |
| 30 |
ScottishCoder/AuldLangSynth
AuldLangSynth is an open-source data-centric language synthesis platform... |
|
Experimental |
| 31 |
MichiganNLP/depression_synthetic_data
Can LMs generate useful synthetic data for the mental health domain? |
|
Experimental |
| 32 |
Chessperson/multiomics-synth
R synthpop for proteomics/metabolomics cohorts (your 30 cohorts, 6k+ cols).... |
|
Experimental |
| 33 |
ChandanKSahu/ReqList_ReqNet_ReqSim
ReqList, ReqNet and ReqSim Datasets for the publication `A Network and... |
|
Experimental |
| 34 |
jd-coderepos/scisynthesis
for prompts, dataset, and code addressing the task of scientific synthesis |
|
Experimental |
| 35 |
ZEKE320/llm-dataset-generator
The LLM Dataset Generator is an open source tool for generating text data... |
|
Experimental |
| 36 |
pezzos/jsonl_dataset_generator
Generate rich JSONL datasets from topics to fine-tune Large Language Models.... |
|
Experimental |
| 37 |
CartographerLabs/Lights-Camera-Extremism
A Social Network Synthetic Dataset Generation Framework |
|
Experimental |
| 38 |
tiddly-gittly/TiddlyWiki-LLM-dataset
WikiText syntax dataset generation pipeline and open dataset for auto UI... |
|
Experimental |
| 39 |
daspartho/DistillClassifier
Easily generate synthetic data for classification tasks using LLMs |
|
Experimental |
| 40 |
AikyamLab/regtext
A framework to generate unlearnable text data |
|
Experimental |
| 41 |
Ki-Seki/autotab
Automatically fill in missing values in tabular data using in-context... |
|
Experimental |
| 42 |
thalesbertaglia/instasynth
Synthetic Instagram Post Generation for Social Media Research |
|
Experimental |