Curator vs data-prep-kit — 71 vs 64 Quality Score

Curator

71

Verified

data-prep-kit

64

Established

Maintenance 22/25

Adoption 10/25

Maturity 16/25

Community 23/25

Maintenance 13/25

Adoption 10/25

Maturity 16/25

Community 25/25

Stars: 1,443

Forks: 230

Downloads: —

Commits (30d): 71

Language: Python

License: Apache-2.0

Stars: 906

Forks: 247

Downloads: —

Commits (30d): 2

Language: HTML

License: Apache-2.0

No Package No Dependents

About Curator

NVIDIA-NeMo/Curator

Scalable data pre processing and curation toolkit for LLMs

This tool helps AI engineers and researchers prepare massive datasets for training large language models and other generative AI. It takes raw text, images, video, or audio data from various sources and outputs cleaned, filtered, and deduplicated datasets. The primary users are MLOps engineers and AI researchers focused on building and improving large-scale AI models.

AI model training large language models generative AI data preprocessing machine learning operations

About data-prep-kit

data-prep-kit/data-prep-kit

Open source project for data preparation for GenAI applications

This kit helps AI application developers prepare unstructured data for use in large language models (LLMs). It takes raw text, code, or image data from various sources like PDFs, HTML, or zip files and cleanses, transforms, and enriches it. The output is high-quality, structured data ready for pre-training, fine-tuning, or building Retrieval Augmented Generation (RAG) applications.

AI development LLM data preparation natural language processing RAG applications unstructured data

Curator and data-prep-kit

About Curator

About data-prep-kit