NLP Dataset Collections NLP Tools
Curated lists, catalogs, and repositories of NLP datasets organized by language, task, or domain. Does NOT include individual datasets, dataset creation tools, or data annotation platforms.
There are 93 nlp dataset collections tools tracked. 3 score above 50 (established tier). The highest-rated is acl-org/acl-anthology at 69/100 with 693 stars. 1 of the top 10 are actively maintained.
Get all 93 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=nlp-dataset-collections&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
acl-org/acl-anthology
Data and software for building the ACL Anthology. |
|
Established |
| 2 |
anoopkunchukuttan/indic_nlp_library
Resources and tools for Indian language Natural Language Processing |
|
Established |
| 3 |
SudhirGadhvi/open-vernacular-ai-kit
Clean Indian code-mixed text before it reaches your LLM. |
|
Established |
| 4 |
CLUEbenchmark/CLUECorpus2020
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料 |
|
Emerging |
| 5 |
KennethEnevoldsen/scandinavian-embedding-benchmark
A Scandinavian Benchmark for sentence embeddings |
|
Emerging |
| 6 |
Separius/awesome-sentence-embedding
A curated list of pretrained sentence and word embedding models |
|
Emerging |
| 7 |
AndyTheFactory/romanian-nlp-datasets
A list of Romanian NLP Datasets |
|
Emerging |
| 8 |
banglakit/awesome-bangla
A collection of tools, datasets and resources on Bangla computing |
|
Emerging |
| 9 |
masakhane-io/masakhane-community
All our community docs! Start here! Lets put Africa on the NLP Map |
|
Emerging |
| 10 |
mirfan899/Urdu
Collection of Urdu datasets for POS, NER, Sentiment, Summarization and NLP tasks. |
|
Emerging |
| 11 |
AI4Bharat/Indic-BERT-v1
Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and... |
|
Emerging |
| 12 |
knadh/indic.page
A directory of Indic (Indian) language computing resources. |
|
Emerging |
| 13 |
Vikhram-S/IndianConstitution
A Python library for exploring the Constitution of India. |
|
Emerging |
| 14 |
computerclubkec/constitution-of-nepal-dataset
A structured and organized dataset of the Constitution of Nepal in... |
|
Emerging |
| 15 |
yisaienkov/tinysets
The project aims to collect various datasets for tasks such as... |
|
Emerging |
| 16 |
Smat26/Roman-Urdu-Dataset
Compilation of Manually Tagged Roman Urdu Dataset (Urdu written in... |
|
Emerging |
| 17 |
dsfsi/masakhane-web
Masakhane Web is a translation web application for solely African Languages. |
|
Emerging |
| 18 |
shjwudp/c4-dataset-script
Inspired by google c4, here is a series of colossal clean data cleaning... |
|
Emerging |
| 19 |
praatibhsurana/Hinglish_Hindi_WSD
A pipeline for transliteration, spell correction, POS tagging and word sense... |
|
Emerging |
| 20 |
amir9ume/urdu_ghazals_rekhta
Dataset for Urdu Ghazals |
|
Emerging |
| 21 |
jcblaisecruz02/Filipino-Text-Benchmarks
Open-source benchmark datasets and pretrained transformer models in the... |
|
Emerging |
| 22 |
CLUEbenchmark/CLUEPretrainedModels
高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型 |
|
Emerging |
| 23 |
Vikhram-S/IndianConstitution-js
A robust JavaScript library designed to provide seamless access to the... |
|
Emerging |
| 24 |
Andrews2017/africanlp-public-datasets
A repository for publicly/freely available Natural Language Processing (NLP)... |
|
Emerging |
| 25 |
uma-pi1/OPIEC
Reading the data from OPIEC - an Open Information Extraction corpus |
|
Emerging |
| 26 |
federicarollo/Italian-Crime-News
A dataset from the Gazzetta di Modena newspaper about crime events in the... |
|
Experimental |
| 27 |
cambridgeltl/cometa
Corpus of Online Medical EnTities: the cometA corpus |
|
Experimental |
| 28 |
csebuetnlp/banglabert
This repository contains the official release of the model "BanglaBERT" and... |
|
Experimental |
| 29 |
banglanlp/bnlp-resources
Awesome datasets for Bangla language computing. |
|
Experimental |
| 30 |
zhanlaoban/NLP_PEMDC
NLP Predtrained Embeddings, Models and Datasets Collections(NLP_PEMDC). The... |
|
Experimental |
| 31 |
UsmanNiazi/DUC-2004-Dataset
This Repo Contains the DUC 2004 Dataset |
|
Experimental |
| 32 |
jacklanda/CCAE
[NLPCC 2023] CCAE: A Corpus of Chinese-based Asian Englishes |
|
Experimental |
| 33 |
MuhammadYaseenKhan/Urdu-Sentiment-Corpus
Labelled Dataset for Urdu Sentiment Analysis |
|
Experimental |
| 34 |
Sueza-project/Sueza_project
Linguistic database collection for the revitalization of Cameroonian local... |
|
Experimental |
| 35 |
anoopkunchukuttan/meteor_indic
METEOR for Indian languages (originally forked from METEOR 1.4) |
|
Experimental |
| 36 |
crux82/huric
HuRIC 2.0 - the Human Robot Interaction Corpus |
|
Experimental |
| 37 |
s-bose/Walks-into-a-bar-dataset
A dataset containing 1000+ walks-into-a-bar jokes scraped from the internet. |
|
Experimental |
| 38 |
mussacharles60/swahili-dictionary
Swahili dictionary for implementing in your projects |
|
Experimental |
| 39 |
EthioNLP/Resource
This repository contains research papers and datasets for different NLP... |
|
Experimental |
| 40 |
Riccorl/nlp-dataset-readers
Readers for NLP Datasets |
|
Experimental |
| 41 |
lanwuwei/Twitter-URL-Corpus
Large scale sentential paraphrases collection and annotation |
|
Experimental |
| 42 |
mrpeerat/Thai-Sentence-Vector-Benchmark
Benchmark for Thai sentence representation |
|
Experimental |
| 43 |
kili-technology/awesome-datasets
A comprehensive list of annotated training datasets classified by use case. |
|
Experimental |
| 44 |
UKPLab/useb
Heterogenous, Task- and Domain-Specific Benchmark for Unsupervised Sentence... |
|
Experimental |
| 45 |
hrgupta/indian-scriptures
This repository contains various Indian scriptures 📜 in a structured .csv... |
|
Experimental |
| 46 |
COS301-SE-2025/Mafoko
Mafoko is a progressive web app (PWA) that provides access to multilingual... |
|
Experimental |
| 47 |
reem-codes/ArMATH
ArMATH: The Arabic Math Word Problem dataset. Accepted in LREC2022 |
|
Experimental |
| 48 |
maxent-ai/Datasets
datasets with text data for use in NLP, Text analysis, information... |
|
Experimental |
| 49 |
mapmeld/hindi-bert
Hindi NLP work |
|
Experimental |
| 50 |
t-systems-on-site-services-gmbh/german-elmo-model
This is a german ELMo deep contextualized word representation. It is trained... |
|
Experimental |
| 51 |
massanishi/hackernews-post-datasets
Datasets for hackernews posts |
|
Experimental |
| 52 |
kassemsabeh/open-brand
The dataset contains over 250k product brand-value annotations with more... |
|
Experimental |
| 53 |
Hironsan/wiki-article-dataset
Wikipedia article dataset |
|
Experimental |
| 54 |
aalok-sathe/sentspace
a module to obtain diverse real-world-grounded features for sentences for... |
|
Experimental |
| 55 |
Pogayo/Luo-News-Dataset
This repo contains LUO corpus for Named Entity Recognition. The text comes... |
|
Experimental |
| 56 |
hyunwoongko/nlp-datasets
Curation note of NLP datasets |
|
Experimental |
| 57 |
SuzanaK/language_datasets
Language Datasets for NLP, Machine Learning, and Map Creation |
|
Experimental |
| 58 |
NetworkTheoryAppliedResearchInstitute/anthropology-
Comprehensive AI training corpus for anthropology education: 580K tokens... |
|
Experimental |
| 59 |
VLa-Labs/Danish-Language-Dataset-List
A curated metadata collection of 31 publicly available Danish language datasets. |
|
Experimental |
| 60 |
quality-attributes/datasets
Official data sources for the Quality Attributes project |
|
Experimental |
| 61 |
OumaimaHourrane/MA_Open_Datasets
Moroccan NLP Datasets and Corpora |
|
Experimental |
| 62 |
nlp-waseda/comet-atomic-ja
COMET-ATOMIC ja |
|
Experimental |
| 63 |
filbench/filbench-eval
Experiments and Analyses for FilBench: An Open LLM Leaderboard for Filipino... |
|
Experimental |
| 64 |
megagonlabs/ebe-dataset
Evidence-based Explanation Dataset (AACL-IJCNLP 2020) |
|
Experimental |
| 65 |
mohansaidinesh/Datasets
Datasets for Machine Learning |
|
Experimental |
| 66 |
ART-Group-it/GASP
GASP! Dataset - Generating Abstracts of Scientific Papers from Abstracts of... |
|
Experimental |
| 67 |
pln-fing-udelar/humor
HUMOR dataset for humor research |
|
Experimental |
| 68 |
davidwarrior22/machine-translation-for-african-languages
This repository focuses on developing machine translation and NLP tools... |
|
Experimental |
| 69 |
aviaefrat/cryptonite
The Official Repository of the Cryptonite Dataset |
|
Experimental |
| 70 |
viperx-20/awesome-sentence-embedding
A curated list of pretrained sentence and word embedding models |
|
Experimental |
| 71 |
kaisugi/datasets-for-sequential-sentence-classification
Curated list of public datasets which focus on sentence classification in... |
|
Experimental |
| 72 |
Niger-Volta-LTI/urhobo-text
Urhobo language training text for NLP, ASR and TTS tasks |
|
Experimental |
| 73 |
jahidulzaid/BanglaNostalgia
A benchmark and training pipeline for detecting nostalgia in Bangla text.... |
|
Experimental |
| 74 |
dsfsi/project-state-capture
Zondo Commission or State Capture Commission Transcripts |
|
Experimental |
| 75 |
mzmmoazam/kashmiri_dataset
Data and tool to fetch kashmiri text |
|
Experimental |
| 76 |
createmomo/supporting-comedy-writers
Predicting Audience’s Response from Sketch Comedy and Crosstalk Scripts (A... |
|
Experimental |
| 77 |
bluechoochoo/retired_comedy_phrases
A Casual Spreadsheets resource |
|
Experimental |
| 78 |
Archaeocomputers/Bessarion
A text and imaging dataset of Byzantine-era Medieval Greek inscriptions. |
|
Experimental |
| 79 |
jonas-becker/pd-human-vs-machine-content
The official repository for the paper "Paraphrase Detection: Human vs.... |
|
Experimental |
| 80 |
slvnwhrl/sigmorphon2022-models
This repository contains the models used by the CLUZH team for the... |
|
Experimental |
| 81 |
CyberAgentAILab/AdParaphrase
This repository contains data for our paper "AdParaphrase: Paraphrase... |
|
Experimental |
| 82 |
ICPSR/dataset-references
NER pipeline to detect dataset references for ASIST 2022 paper |
|
Experimental |
| 83 |
KushtrimVisoka/Kosovo-Parliament-Transcriptions
NOTE: The dataset is maintained exclusively on HuggingFace Datasets. The... |
|
Experimental |
| 84 |
OpenCENIA/SRN
Spanish Resources and Evaluation |
|
Experimental |
| 85 |
dsfsi/PuoData
Curated corpora for Setswana. Used to train PuoBERTa. |
|
Experimental |
| 86 |
radi-cho/noisy-sentences-dataset
550K sentences in 5 European languages augmented with noise for training and... |
|
Experimental |
| 87 |
NoelShallum/all-indian-acts
Repository containing all Indian Acts and statutes in the PDF and txt... |
|
Experimental |
| 88 |
rmdodhia/dataset-detection
Detects datasets used in journal papers |
|
Experimental |
| 89 |
dsfsi/zabantu-beta
ZaBantu is a fleet of light-weight Masked Language Models for Southern Bantu... |
|
Experimental |
| 90 |
MusfiqDehan/bn-en-aligner
Tool to easily align Bangla and English words from sentences |
|
Experimental |
| 91 |
BrianMsane/siSwati-Datasets
Repository for siSwati NLP datasets which I have worked on in my research.... |
|
Experimental |
| 92 |
felixgiov/public-meeting
Dataset from the paper "Information Extraction from Public Meeting Articles" |
|
Experimental |
| 93 |
metriccoders/metriccoders_datasets
This is the Metric Coders repository containing all the datasets for machine... |
|
Experimental |