synthetic-data-kit and Synthetic-data-gen

These tools are competitors, with meta-llama/synthetic-data-kit likely offering a more comprehensive and robust solution for generating high-quality synthetic datasets, as indicated by its significantly higher star count, compared to tirthajyoti/Synthetic-data-gen which provides a broader collection of synthetic data generation methods that may be less focused on quality optimization.

synthetic-data-kit
63
Established
Synthetic-data-gen
46
Emerging
Maintenance 6/25
Adoption 10/25
Maturity 25/25
Community 22/25
Maintenance 0/25
Adoption 9/25
Maturity 16/25
Community 21/25
Stars: 1,524
Forks: 215
Downloads:
Commits (30d): 0
Language: Python
License: MIT
Stars: 83
Forks: 42
Downloads:
Commits (30d): 0
Language: Jupyter Notebook
License: MIT
No risk flags
Stale 6m No Package No Dependents

About synthetic-data-kit

meta-llama/synthetic-data-kit

Tool for generating high quality Synthetic datasets

Supports multi-format document ingestion (PDF, DOCX, HTML, YouTube transcripts) and generates structured fine-tuning datasets through a modular 4-stage pipeline: ingest → create (QA pairs, Chain-of-Thought reasoning, or summaries) → curate (using Llama-as-judge quality filtering) → save-as (converts to Alpaca, OpenAI, or HuggingFace formats). Uses Lance vector storage by default and integrates with vLLM or external LLM APIs for generation, with full YAML-based configuration overrides.

About Synthetic-data-gen

tirthajyoti/Synthetic-data-gen

Various methods for generating synthetic data for data science and ML

Scores updated daily from GitHub, PyPI, and npm data. How scores work