NLP Dataset Collections NLP Tools

Curated lists, catalogs, and repositories of NLP datasets organized by language, task, or domain. Does NOT include individual datasets, dataset creation tools, or data annotation platforms.

There are 93 nlp dataset collections tools tracked. 3 score above 50 (established tier). The highest-rated is acl-org/acl-anthology at 69/100 with 693 stars. 1 of the top 10 are actively maintained.

Get all 93 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=nlp-dataset-collections&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	acl-org/acl-anthology Data and software for building the ACL Anthology.	69	Established	693	Python
2	anoopkunchukuttan/indic_nlp_library Resources and tools for Indian language Natural Language Processing	67	Established	630	Python
3	SudhirGadhvi/open-vernacular-ai-kit Clean Indian code-mixed text before it reaches your LLM.	53	Established	5	Python
4	CLUEbenchmark/CLUECorpus2020 Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料	46	Emerging	1,002	—
5	KennethEnevoldsen/scandinavian-embedding-benchmark A Scandinavian Benchmark for sentence embeddings	40	Emerging	46	Python
6	Separius/awesome-sentence-embedding A curated list of pretrained sentence and word embedding models	40	Emerging	2,290	Python
7	AndyTheFactory/romanian-nlp-datasets A list of Romanian NLP Datasets	39	Emerging	56	—
8	banglakit/awesome-bangla A collection of tools, datasets and resources on Bangla computing	38	Emerging	564	—
9	masakhane-io/masakhane-community All our community docs! Start here! Lets put Africa on the NLP Map	37	Emerging	67	—
10	mirfan899/Urdu Collection of Urdu datasets for POS, NER, Sentiment, Summarization and NLP tasks.	37	Emerging	73	—
11	AI4Bharat/Indic-BERT-v1 Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and...	37	Emerging	291	Python
12	knadh/indic.page A directory of Indic (Indian) language computing resources.	35	Emerging	65	HTML
13	Vikhram-S/IndianConstitution A Python library for exploring the Constitution of India.	35	Emerging	2	Python
14	computerclubkec/constitution-of-nepal-dataset A structured and organized dataset of the Constitution of Nepal in...	35	Emerging	7	—
15	yisaienkov/tinysets The project aims to collect various datasets for tasks such as...	35	Emerging	6	Python
16	Smat26/Roman-Urdu-Dataset Compilation of Manually Tagged Roman Urdu Dataset (Urdu written in...	34	Emerging	34	—
17	dsfsi/masakhane-web Masakhane Web is a translation web application for solely African Languages.	34	Emerging	37	Jupyter Notebook
18	shjwudp/c4-dataset-script Inspired by google c4, here is a series of colossal clean data cleaning...	34	Emerging	135	Python
19	praatibhsurana/Hinglish_Hindi_WSD A pipeline for transliteration, spell correction, POS tagging and word sense...	33	Emerging	37	Python
20	amir9ume/urdu_ghazals_rekhta Dataset for Urdu Ghazals	32	Emerging	20	Jupyter Notebook
21	jcblaisecruz02/Filipino-Text-Benchmarks Open-source benchmark datasets and pretrained transformer models in the...	31	Emerging	64	Python
22	CLUEbenchmark/CLUEPretrainedModels 高质量中文预训练模型集合：最先进大模型、最快小模型、相似度专门模型	31	Emerging	816	Python
23	Vikhram-S/IndianConstitution-js A robust JavaScript library designed to provide seamless access to the...	31	Emerging	1	JavaScript
24	Andrews2017/africanlp-public-datasets A repository for publicly/freely available Natural Language Processing (NLP)...	30	Emerging	114	—
25	uma-pi1/OPIEC Reading the data from OPIEC - an Open Information Extraction corpus	30	Emerging	38	Java
26	federicarollo/Italian-Crime-News A dataset from the Gazzetta di Modena newspaper about crime events in the...	29	Experimental	7	Java
27	cambridgeltl/cometa Corpus of Online Medical EnTities: the cometA corpus	29	Experimental	51	Jupyter Notebook
28	csebuetnlp/banglabert This repository contains the official release of the model "BanglaBERT" and...	29	Experimental	248	Python
29	banglanlp/bnlp-resources Awesome datasets for Bangla language computing.	28	Experimental	64	Python
30	zhanlaoban/NLP_PEMDC NLP Predtrained Embeddings, Models and Datasets Collections(NLP_PEMDC). The...	27	Experimental	65	—
31	UsmanNiazi/DUC-2004-Dataset This Repo Contains the DUC 2004 Dataset	27	Experimental	5	—
32	jacklanda/CCAE [NLPCC 2023] CCAE: A Corpus of Chinese-based Asian Englishes	26	Experimental	57	Python
33	MuhammadYaseenKhan/Urdu-Sentiment-Corpus Labelled Dataset for Urdu Sentiment Analysis	25	Experimental	9	—
34	Sueza-project/Sueza_project Linguistic database collection for the revitalization of Cameroonian local...	24	Experimental	2	HTML
35	anoopkunchukuttan/meteor_indic METEOR for Indian languages (originally forked from METEOR 1.4)	24	Experimental	3	Java
36	crux82/huric HuRIC 2.0 - the Human Robot Interaction Corpus	24	Experimental	17	—
37	s-bose/Walks-into-a-bar-dataset A dataset containing 1000+ walks-into-a-bar jokes scraped from the internet.	24	Experimental	2	Jupyter Notebook
38	mussacharles60/swahili-dictionary Swahili dictionary for implementing in your projects	24	Experimental	4	JavaScript
39	EthioNLP/Resource This repository contains research papers and datasets for different NLP...	23	Experimental	1	—
40	Riccorl/nlp-dataset-readers Readers for NLP Datasets	23	Experimental	3	Python
41	lanwuwei/Twitter-URL-Corpus Large scale sentential paraphrases collection and annotation	23	Experimental	46	HTML
42	mrpeerat/Thai-Sentence-Vector-Benchmark Benchmark for Thai sentence representation	22	Experimental	133	Jupyter Notebook
43	kili-technology/awesome-datasets A comprehensive list of annotated training datasets classified by use case.	22	Experimental	38	—
44	UKPLab/useb Heterogenous, Task- and Domain-Specific Benchmark for Unsupervised Sentence...	22	Experimental	29	Python
45	hrgupta/indian-scriptures This repository contains various Indian scriptures 📜 in a structured .csv...	22	Experimental	3	Jupyter Notebook
46	COS301-SE-2025/Mafoko Mafoko is a progressive web app (PWA) that provides access to multilingual...	21	Experimental	2	TypeScript
47	reem-codes/ArMATH ArMATH: The Arabic Math Word Problem dataset. Accepted in LREC2022	20	Experimental	10	Python
48	maxent-ai/Datasets datasets with text data for use in NLP, Text analysis, information...	20	Experimental	16	Jupyter Notebook
49	mapmeld/hindi-bert Hindi NLP work	20	Experimental	14	Jupyter Notebook
50	t-systems-on-site-services-gmbh/german-elmo-model This is a german ELMo deep contextualized word representation. It is trained...	20	Experimental	28	—
51	massanishi/hackernews-post-datasets Datasets for hackernews posts	20	Experimental	16	—
52	kassemsabeh/open-brand The dataset contains over 250k product brand-value annotations with more...	20	Experimental	14	Python
53	Hironsan/wiki-article-dataset Wikipedia article dataset	20	Experimental	12	Jupyter Notebook
54	aalok-sathe/sentspace a module to obtain diverse real-world-grounded features for sentences for...	19	Experimental	5	Python
55	Pogayo/Luo-News-Dataset This repo contains LUO corpus for Named Entity Recognition. The text comes...	19	Experimental	7	—
56	hyunwoongko/nlp-datasets Curation note of NLP datasets	19	Experimental	98	—
57	SuzanaK/language_datasets Language Datasets for NLP, Machine Learning, and Map Creation	19	Experimental	6	—
58	NetworkTheoryAppliedResearchInstitute/anthropology- Comprehensive AI training corpus for anthropology education: 580K tokens...	19	Experimental	—	—
59	VLa-Labs/Danish-Language-Dataset-List A curated metadata collection of 31 publicly available Danish language datasets.	18	Experimental	3	—
60	quality-attributes/datasets Official data sources for the Quality Attributes project	18	Experimental	6	Jupyter Notebook
61	OumaimaHourrane/MA_Open_Datasets Moroccan NLP Datasets and Corpora	16	Experimental	3	Jupyter Notebook
62	nlp-waseda/comet-atomic-ja COMET-ATOMIC ja	16	Experimental	31	Python
63	filbench/filbench-eval Experiments and Analyses for FilBench: An Open LLM Leaderboard for Filipino...	16	Experimental	9	Python
64	megagonlabs/ebe-dataset Evidence-based Explanation Dataset (AACL-IJCNLP 2020)	15	Experimental	18	PLSQL
65	mohansaidinesh/Datasets Datasets for Machine Learning	14	Experimental	4	Python
66	ART-Group-it/GASP GASP! Dataset - Generating Abstracts of Scientific Papers from Abstracts of...	14	Experimental	9	—
67	pln-fing-udelar/humor HUMOR dataset for humor research	14	Experimental	7	HTML
68	davidwarrior22/machine-translation-for-african-languages This repository focuses on developing machine translation and NLP tools...	14	Experimental	—	TeX
69	aviaefrat/cryptonite The Official Repository of the Cryptonite Dataset	14	Experimental	23	Python
70	viperx-20/awesome-sentence-embedding A curated list of pretrained sentence and word embedding models	13	Experimental	5	Python
71	kaisugi/datasets-for-sequential-sentence-classification Curated list of public datasets which focus on sentence classification in...	13	Experimental	5	—
72	Niger-Volta-LTI/urhobo-text Urhobo language training text for NLP, ASR and TTS tasks	13	Experimental	6	—
73	jahidulzaid/BanglaNostalgia A benchmark and training pipeline for detecting nostalgia in Bangla text....	13	Experimental	2	Python
74	dsfsi/project-state-capture Zondo Commission or State Capture Commission Transcripts	12	Experimental	3	—
75	mzmmoazam/kashmiri_dataset Data and tool to fetch kashmiri text	12	Experimental	16	HTML
76	createmomo/supporting-comedy-writers Predicting Audience’s Response from Sketch Comedy and Crosstalk Scripts (A...	12	Experimental	3	—
77	bluechoochoo/retired_comedy_phrases A Casual Spreadsheets resource	12	Experimental	13	—
78	Archaeocomputers/Bessarion A text and imaging dataset of Byzantine-era Medieval Greek inscriptions.	12	Experimental	4	Python
79	jonas-becker/pd-human-vs-machine-content The official repository for the paper "Paraphrase Detection: Human vs....	12	Experimental	3	HTML
80	slvnwhrl/sigmorphon2022-models This repository contains the models used by the CLUZH team for the...	12	Experimental	3	Python
81	CyberAgentAILab/AdParaphrase This repository contains data for our paper "AdParaphrase: Paraphrase...	12	Experimental	1	—
82	ICPSR/dataset-references NER pipeline to detect dataset references for ASIST 2022 paper	12	Experimental	3	Jupyter Notebook
83	KushtrimVisoka/Kosovo-Parliament-Transcriptions NOTE: The dataset is maintained exclusively on HuggingFace Datasets. The...	12	Experimental	3	Jupyter Notebook
84	OpenCENIA/SRN Spanish Resources and Evaluation	12	Experimental	3	—
85	dsfsi/PuoData Curated corpora for Setswana. Used to train PuoBERTa.	12	Experimental	3	—
86	radi-cho/noisy-sentences-dataset 550K sentences in 5 European languages augmented with noise for training and...	11	Experimental	2	—
87	NoelShallum/all-indian-acts Repository containing all Indian Acts and statutes in the PDF and txt...	11	Experimental	2	—
88	rmdodhia/dataset-detection Detects datasets used in journal papers	11	Experimental	2	Python
89	dsfsi/zabantu-beta ZaBantu is a fleet of light-weight Masked Language Models for Southern Bantu...	11	Experimental	2	Python
90	MusfiqDehan/bn-en-aligner Tool to easily align Bangla and English words from sentences	11	Experimental	—	JavaScript
91	BrianMsane/siSwati-Datasets Repository for siSwati NLP datasets which I have worked on in my research....	10	Experimental	1	—
92	felixgiov/public-meeting Dataset from the paper "Information Extraction from Public Meeting Articles"	10	Experimental	1	—
93	metriccoders/metriccoders_datasets This is the Metric Coders repository containing all the datasets for machine...	10	Experimental	1	—