Synthetic Data Generation LLM Tools

Tools for generating synthetic datasets and training data for LLMs through various methods (QA pairs, tabular data, code, structured extraction). Does NOT include general data processing, data augmentation for images, or dataset annotation/curation platforms.

There are 42 synthetic data generation tools tracked. 3 score above 50 (established tier). The highest-rated is InternScience/GraphGen at 56/100 with 978 stars. 1 of the top 10 are actively maintained.

Get all 42 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=synthetic-data-generation&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 InternScience/GraphGen

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven...

56
Established
2 rasinmuhammed/misata

High-performance open-source synthetic data engine. Uses LLMs for schema...

53
Established
3 timothepearce/synda

A CLI for generating synthetic data

50
Established
4 dmanuel64/codablellm

A framework for creating and curating high-quality code datasets tailored...

39
Emerging
5 ZhuLinsen/FastDatasets

A powerful tool for creating high-quality training datasets for Large...

39
Emerging
6 ziegler-ingo/CRAFT

[TACL, EMNLP 2025 Oral] Code, datasets, and checkpoints for the paper "CRAFT...

39
Emerging
7 BatsResearch/bonito

A lightweight library for generating synthetic instruction tuning datasets...

36
Emerging
8 Oqura-ai/deepresearch-datagen-cli

Using deep research workflow to generate datasets for finetuning LLMs.

36
Emerging
9 Alannikos/edg4llm

A unified tool to generate fine-tuning datasets for LLMs, including...

34
Emerging
10 asaparov/prontoqa

Synthetic question-answering dataset to formally analyze the...

34
Emerging
11 nalinrajendran/synthetic-LLM-QA-dataset-generator

Create synthetic datasets for training and testing Language Learning Models...

30
Emerging
12 Itachi-Uchiha581/Auto-Data

Auto Data is a library designed for quick and effortless creation of...

29
Experimental
13 ISE-FIZKarlsruhe/concept_extraction

ConExion

28
Experimental
14 GURPREETKAURJETHRA/Synthetic-Data-Generation-using-LLM

Synthetic Data Generation using LLM via Argilla, Distilabel, ChatGPT, etc.

27
Experimental
15 kevinscaria/TarGEN

Targeted Data Generation with Large Language Models

27
Experimental
16 Glavin001/Data2AITextbook

🚀 Automatically convert unstructured data into a high-quality 'textbook'...

26
Experimental
17 BothBosu/Synthetic-Data-for-Scam-Detection-Leveraging-LLMs-to-Train-Deep-Learning-Models

This repository contains the source code and synthetic datasets used in the...

25
Experimental
18 copyleftdev/faux-foundry

FauxFoundry - Synthetic data generation powered by local LLMs

25
Experimental
19 danmurf/datakeg

Brew synthetic training data from your documentation using LLMs

22
Experimental
20 Red1998/faux-foundry

🤖 Generate unique synthetic datasets effortlessly with FauxFoundry, using...

22
Experimental
21 dmeldrum6/synthetic-dataset

Web based tool for generating Q&A datasets from an LLM

22
Experimental
22 jehumtine/synthetic_data_generator

This script is designed to convert bodies of text into a question and answer...

20
Experimental
23 rodrigobnogueira/faker-ai-provider

🤖 Faker provider for generating AI/ML fake data - models, companies,...

20
Experimental
24 MelNajkar/llm-data-augmentation-sentiment

LLM-based synthetic data generation for improving sentiment classification...

19
Experimental
25 jqwangai/SynPT

An Improved Data Synthesis Method Driven by Large Language Models for...

18
Experimental
26 Pro-GenAI/DataClassifier

An AI-driven approach to Label LLM Training Data

18
Experimental
27 yzhan238/TELEClass

The source code used for paper "TELEClass: Taxonomy Enrichment and...

17
Experimental
28 nphdang/Pred-LLM

Generating tabular data via Large Language Models (LLMs)

15
Experimental
29 aekpalakorn/TGRE-Classification

Source code for the taxonomic knowledge assessment and occupation/skill...

15
Experimental
30 ScottishCoder/AuldLangSynth

AuldLangSynth is an open-source data-centric language synthesis platform...

15
Experimental
31 MichiganNLP/depression_synthetic_data

Can LMs generate useful synthetic data for the mental health domain?

15
Experimental
32 Chessperson/multiomics-synth

R synthpop for proteomics/metabolomics cohorts (your 30 cohorts, 6k+ cols)....

14
Experimental
33 ChandanKSahu/ReqList_ReqNet_ReqSim

ReqList, ReqNet and ReqSim Datasets for the publication `A Network and...

14
Experimental
34 jd-coderepos/scisynthesis

for prompts, dataset, and code addressing the task of scientific synthesis

14
Experimental
35 ZEKE320/llm-dataset-generator

The LLM Dataset Generator is an open source tool for generating text data...

13
Experimental
36 pezzos/jsonl_dataset_generator

Generate rich JSONL datasets from topics to fine-tune Large Language Models....

13
Experimental
37 CartographerLabs/Lights-Camera-Extremism

A Social Network Synthetic Dataset Generation Framework

12
Experimental
38 tiddly-gittly/TiddlyWiki-LLM-dataset

WikiText syntax dataset generation pipeline and open dataset for auto UI...

12
Experimental
39 daspartho/DistillClassifier

Easily generate synthetic data for classification tasks using LLMs

11
Experimental
40 AikyamLab/regtext

A framework to generate unlearnable text data

11
Experimental
41 Ki-Seki/autotab

Automatically fill in missing values in tabular data using in-context...

10
Experimental
42 thalesbertaglia/instasynth

Synthetic Instagram Post Generation for Social Media Research

10
Experimental