LLM Data Labeling LLM Tools
Tools and platforms for annotating, labeling, and cleaning datasets using LLMs, including data quality management and weak supervision frameworks. Does NOT include general data processing pipelines, embeddings-only tools, or non-annotation data transformation.
There are 36 llm data labeling tools tracked. 1 score above 70 (verified tier). The highest-rated is NVIDIA-NeMo/Curator at 74/100 with 1,443 stars. 3 of the top 10 are actively maintained.
Get all 36 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-data-labeling&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
NVIDIA-NeMo/Curator
Scalable data pre processing and curation toolkit for LLMs |
|
Verified |
| 2 |
MigoXLab/dingo
Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool |
|
Established |
| 3 |
data-prep-kit/data-prep-kit
Open source project for data preparation for GenAI applications |
|
Established |
| 4 |
cleanlab/cleanlab-studio
Client interface to Cleanlab Studio |
|
Established |
| 5 |
TheDataStation/pneuma
LLM-Powered Data Discovery System for Tabular Data |
|
Emerging |
| 6 |
GUNDAM-Labet/GUNDAM
GUNDAM is a data management system that prioritizes data using language models. |
|
Emerging |
| 7 |
nxank4/loclean
⚡️ The All-in-One Local AI Data Cleaning Library |
|
Emerging |
| 8 |
SCAI-BIO/datastew
Python library for intelligent data stewardship using Large Language Model... |
|
Emerging |
| 9 |
jpmorganchase/CodeQuest
CodeQUEST is a generalizable framework which leverages LLMs to iteratively... |
|
Emerging |
| 10 |
BatsResearch/alfred
A system for prompted weak supervision. Alfred is a powerful tool that... |
|
Emerging |
| 11 |
AI4Bharat/Anudesh
An open source platform to annotate data for Large language models - at scale |
|
Emerging |
| 12 |
codepawl/loclean
An AI Data Cleaning Library |
|
Emerging |
| 13 |
saran9991/llm-data-annotation
Use Large Language Models like OpenAI's GPT-3.5 for data annotation and... |
|
Emerging |
| 14 |
worldbank/llm4data
LLM4Data is a Python library designed to facilitate the application of large... |
|
Emerging |
| 15 |
niclasgriesshaber/llm_patent_pipeline
LLMs for Historical Dataset Construction from Archival Image Scans |
|
Emerging |
| 16 |
hikariming/pindata
PinData is a modern, open-source dataset management platform designed... |
|
Emerging |
| 17 |
cgxjdzz/FeatureForge-LLM
FeatureForge LLM is a Python package that leverages large language models... |
|
Experimental |
| 18 |
codeastra2/llm-feat
Automated feature engineering using Large Language Models (LLMs) for tabular data |
|
Experimental |
| 19 |
J0nasW/science-datalake
Unified data lake of 293M scientific papers from 8 scholarly sources + 13... |
|
Experimental |
| 20 |
benwhalley/soak
soak: rigorous and transparent qualitative analysis with LLMs |
|
Experimental |
| 21 |
tayyab-nlp/AnnotaLoop
AI-assisted document annotation with human-in-the-loop workflows |
|
Experimental |
| 22 |
PennShenLab/FREEFORM
FREEFORM | Knowledge-Driven Feature Selection and Engineering with Large... |
|
Experimental |
| 23 |
ywn7/llm-data-normalization-pattern
🔧 Normalize data intelligently with this serverless pattern leveraging LLMs,... |
|
Experimental |
| 24 |
itamaker/datasetlint
Audit JSONL datasets for duplicates, empty fields, and train/eval leakage. |
|
Experimental |
| 25 |
lechmazur/writing_styles
Documents the style side of the short-story Creative Writing LLM benchmark:... |
|
Experimental |
| 26 |
data-prompt-query/dpq
dpq is an open-source python library that makes prompt-based data... |
|
Experimental |
| 27 |
dab3oon/writing_styles
📚 Analyze stylistic differences in AI-generated flash fiction to understand... |
|
Experimental |
| 28 |
gitEricsson/NuCore
This model converts diverse risk registers (Excel and PDF formats) into... |
|
Experimental |
| 29 |
qubasehq/qudata
A comprehensive LLM data processing system designed to transform raw... |
|
Experimental |
| 30 |
waikato-llm/llm-dataset-converter-all
Meta-library that combines all llm-dataset-converter libraries. |
|
Experimental |
| 31 |
MehrdadJalali-AI/LLM-ELN
Integrating LLMs with ELNs to transform materials science research at KIT,... |
|
Experimental |
| 32 |
sauravattri23/llm-data-catalog
LLM-Powered Data Catalog & Lineage Platform |
|
Experimental |
| 33 |
CodeguruEdison/llm-tagger-api
AI-powered auto-tagging API for repair order notes — uses LLM + rules engine... |
|
Experimental |
| 34 |
jd-coderepos/awases-ald
A repository outlining the use of LLMs to extract structured process... |
|
Experimental |
| 35 |
gabyarte/event-extraction-small-corpus
Event extraction on legal domain´s small corpus using Large Language Models.... |
|
Experimental |
| 36 |
kpmainali/species-knowledge-base
Multi-LLM pipeline for validated extraction and structuring of species-level... |
|
Experimental |