Curator and data-prep-kit
About Curator
NVIDIA-NeMo/Curator
Scalable data pre processing and curation toolkit for LLMs
This tool helps AI engineers and researchers prepare massive datasets for training large language models and other generative AI. It takes raw text, images, video, or audio data from various sources and outputs cleaned, filtered, and deduplicated datasets. The primary users are MLOps engineers and AI researchers focused on building and improving large-scale AI models.
About data-prep-kit
data-prep-kit/data-prep-kit
Open source project for data preparation for GenAI applications
This kit helps AI application developers prepare unstructured data for use in large language models (LLMs). It takes raw text, code, or image data from various sources like PDFs, HTML, or zip files and cleanses, transforms, and enriches it. The output is high-quality, structured data ready for pre-training, fine-tuning, or building Retrieval Augmented Generation (RAG) applications.
Scores updated daily from GitHub, PyPI, and npm data. How scores work