LLM Data Labeling LLM Tools

Tools and platforms for annotating, labeling, and cleaning datasets using LLMs, including data quality management and weak supervision frameworks. Does NOT include general data processing pipelines, embeddings-only tools, or non-annotation data transformation.

There are 36 llm data labeling tools tracked. 1 score above 70 (verified tier). The highest-rated is NVIDIA-NeMo/Curator at 74/100 with 1,443 stars. 3 of the top 10 are actively maintained.

Get all 36 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-data-labeling&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 NVIDIA-NeMo/Curator

Scalable data pre processing and curation toolkit for LLMs

74
Verified
2 MigoXLab/dingo

Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool

67
Established
3 data-prep-kit/data-prep-kit

Open source project for data preparation for GenAI applications

64
Established
4 cleanlab/cleanlab-studio

Client interface to Cleanlab Studio

56
Established
5 TheDataStation/pneuma

LLM-Powered Data Discovery System for Tabular Data

46
Emerging
6 GUNDAM-Labet/GUNDAM

GUNDAM is a data management system that prioritizes data using language models.

46
Emerging
7 nxank4/loclean

⚡️ The All-in-One Local AI Data Cleaning Library

43
Emerging
8 SCAI-BIO/datastew

Python library for intelligent data stewardship using Large Language Model...

41
Emerging
9 jpmorganchase/CodeQuest

CodeQUEST is a generalizable framework which leverages LLMs to iteratively...

40
Emerging
10 BatsResearch/alfred

A system for prompted weak supervision. Alfred is a powerful tool that...

39
Emerging
11 AI4Bharat/Anudesh

An open source platform to annotate data for Large language models - at scale

34
Emerging
12 codepawl/loclean

An AI Data Cleaning Library

34
Emerging
13 saran9991/llm-data-annotation

Use Large Language Models like OpenAI's GPT-3.5 for data annotation and...

33
Emerging
14 worldbank/llm4data

LLM4Data is a Python library designed to facilitate the application of large...

33
Emerging
15 niclasgriesshaber/llm_patent_pipeline

LLMs for Historical Dataset Construction from Archival Image Scans

30
Emerging
16 hikariming/pindata

PinData is a modern, open-source dataset management platform designed...

30
Emerging
17 cgxjdzz/FeatureForge-LLM

FeatureForge LLM is a Python package that leverages large language models...

28
Experimental
18 codeastra2/llm-feat

Automated feature engineering using Large Language Models (LLMs) for tabular data

27
Experimental
19 J0nasW/science-datalake

Unified data lake of 293M scientific papers from 8 scholarly sources + 13...

26
Experimental
20 benwhalley/soak

soak: rigorous and transparent qualitative analysis with LLMs

23
Experimental
21 tayyab-nlp/AnnotaLoop

AI-assisted document annotation with human-in-the-loop workflows

23
Experimental
22 PennShenLab/FREEFORM

FREEFORM | Knowledge-Driven Feature Selection and Engineering with Large...

23
Experimental
23 ywn7/llm-data-normalization-pattern

🔧 Normalize data intelligently with this serverless pattern leveraging LLMs,...

22
Experimental
24 itamaker/datasetlint

Audit JSONL datasets for duplicates, empty fields, and train/eval leakage.

22
Experimental
25 lechmazur/writing_styles

Documents the style side of the short-story Creative Writing LLM benchmark:...

21
Experimental
26 data-prompt-query/dpq

dpq is an open-source python library that makes prompt-based data...

20
Experimental
27 dab3oon/writing_styles

📚 Analyze stylistic differences in AI-generated flash fiction to understand...

16
Experimental
28 gitEricsson/NuCore

This model converts diverse risk registers (Excel and PDF formats) into...

16
Experimental
29 qubasehq/qudata

A comprehensive LLM data processing system designed to transform raw...

15
Experimental
30 waikato-llm/llm-dataset-converter-all

Meta-library that combines all llm-dataset-converter libraries.

15
Experimental
31 MehrdadJalali-AI/LLM-ELN

Integrating LLMs with ELNs to transform materials science research at KIT,...

14
Experimental
32 sauravattri23/llm-data-catalog

LLM-Powered Data Catalog & Lineage Platform

14
Experimental
33 CodeguruEdison/llm-tagger-api

AI-powered auto-tagging API for repair order notes — uses LLM + rules engine...

14
Experimental
34 jd-coderepos/awases-ald

A repository outlining the use of LLMs to extract structured process...

13
Experimental
35 gabyarte/event-extraction-small-corpus

Event extraction on legal domain´s small corpus using Large Language Models....

11
Experimental
36 kpmainali/species-knowledge-base

Multi-LLM pipeline for validated extraction and structuring of species-level...

11
Experimental