synthetic-data-kit and Synthetic-data-gen

These tools are competitors, with meta-llama/synthetic-data-kit likely offering a more comprehensive and robust solution for generating high-quality synthetic datasets, as indicated by its significantly higher star count, compared to tirthajyoti/Synthetic-data-gen which provides a broader collection of synthetic data generation methods that may be less focused on quality optimization.

synthetic-data-kit

Established

Synthetic-data-gen

Emerging

Maintenance 6/25

Adoption 10/25

Maturity 25/25

Community 22/25

Maintenance 0/25

Adoption 9/25

Maturity 16/25

Community 21/25

Stars: 1,524

Forks: 215

Downloads: —

Commits (30d): 0

Language: Python

License: MIT

Stars: 83

Forks: 42

Downloads: —

Commits (30d): 0

Language: Jupyter Notebook

License: MIT

No risk flags

Stale 6m No Package No Dependents

About synthetic-data-kit

meta-llama/synthetic-data-kit

Tool for generating high quality Synthetic datasets

Supports multi-format document ingestion (PDF, DOCX, HTML, YouTube transcripts) and generates structured fine-tuning datasets through a modular 4-stage pipeline: ingest → create (QA pairs, Chain-of-Thought reasoning, or summaries) → curate (using Llama-as-judge quality filtering) → save-as (converts to Alpaca, OpenAI, or HuggingFace formats). Uses Lance vector storage by default and integrates with vLLM or external LLM APIs for generation, with full YAML-based configuration overrides.

About Synthetic-data-gen

tirthajyoti/Synthetic-data-gen

Various methods for generating synthetic data for data science and ML

Related comparisons

synthetic-data-kit and ydata-synthetic

Scores updated daily from GitHub, PyPI, and npm data. How scores work