NVIDIA-NeMo/Curator
Scalable data pre processing and curation toolkit for LLMs
Offers GPU-accelerated pipelines for text, image, video, and audio curation with modality-specific filters (deduplication, quality assessment, NSFW detection, ASR transcription). Built on RAPIDS and Ray for distributed multi-node scaling, achieving 16× speedup on large-scale deduplication tasks. Integrates with NeMo Framework models and supports Common Crawl, WebDataset, and S3-compatible storage sources.
1,443 stars. Actively maintained with 55 commits in the last 30 days.
Stars
1,443
Forks
230
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 12, 2026
Commits (30d)
55
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/NVIDIA-NeMo/Curator"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
MigoXLab/dingo
Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool
data-prep-kit/data-prep-kit
Open source project for data preparation for GenAI applications
cleanlab/cleanlab-studio
Client interface to Cleanlab Studio
GUNDAM-Labet/GUNDAM
GUNDAM is a data management system that prioritizes data using language models.
TheDataStation/pneuma
LLM-Powered Data Discovery System for Tabular Data