NLP Dataset Collections NLP Tools

Curated lists, catalogs, and repositories of NLP datasets organized by language, task, or domain. Does NOT include individual datasets, dataset creation tools, or data annotation platforms.

There are 93 nlp dataset collections tools tracked. 3 score above 50 (established tier). The highest-rated is acl-org/acl-anthology at 69/100 with 693 stars. 1 of the top 10 are actively maintained.

Get all 93 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=nlp-dataset-collections&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 acl-org/acl-anthology

Data and software for building the ACL Anthology.

69
Established
2 anoopkunchukuttan/indic_nlp_library

Resources and tools for Indian language Natural Language Processing

67
Established
3 SudhirGadhvi/open-vernacular-ai-kit

Clean Indian code-mixed text before it reaches your LLM.

53
Established
4 CLUEbenchmark/CLUECorpus2020

Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料

46
Emerging
5 KennethEnevoldsen/scandinavian-embedding-benchmark

A Scandinavian Benchmark for sentence embeddings

40
Emerging
6 Separius/awesome-sentence-embedding

A curated list of pretrained sentence and word embedding models

40
Emerging
7 AndyTheFactory/romanian-nlp-datasets

A list of Romanian NLP Datasets

39
Emerging
8 banglakit/awesome-bangla

A collection of tools, datasets and resources on Bangla computing

38
Emerging
9 masakhane-io/masakhane-community

All our community docs! Start here! Lets put Africa on the NLP Map

37
Emerging
10 mirfan899/Urdu

Collection of Urdu datasets for POS, NER, Sentiment, Summarization and NLP tasks.

37
Emerging
11 AI4Bharat/Indic-BERT-v1

Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and...

37
Emerging
12 knadh/indic.page

A directory of Indic (Indian) language computing resources.

35
Emerging
13 Vikhram-S/IndianConstitution

A Python library for exploring the Constitution of India.

35
Emerging
14 computerclubkec/constitution-of-nepal-dataset

A structured and organized dataset of the Constitution of Nepal in...

35
Emerging
15 yisaienkov/tinysets

The project aims to collect various datasets for tasks such as...

35
Emerging
16 Smat26/Roman-Urdu-Dataset

Compilation of Manually Tagged Roman Urdu Dataset (Urdu written in...

34
Emerging
17 dsfsi/masakhane-web

Masakhane Web is a translation web application for solely African Languages.

34
Emerging
18 shjwudp/c4-dataset-script

Inspired by google c4, here is a series of colossal clean data cleaning...

34
Emerging
19 praatibhsurana/Hinglish_Hindi_WSD

A pipeline for transliteration, spell correction, POS tagging and word sense...

33
Emerging
20 amir9ume/urdu_ghazals_rekhta

Dataset for Urdu Ghazals

32
Emerging
21 jcblaisecruz02/Filipino-Text-Benchmarks

Open-source benchmark datasets and pretrained transformer models in the...

31
Emerging
22 CLUEbenchmark/CLUEPretrainedModels

高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型

31
Emerging
23 Vikhram-S/IndianConstitution-js

A robust JavaScript library designed to provide seamless access to the...

31
Emerging
24 Andrews2017/africanlp-public-datasets

A repository for publicly/freely available Natural Language Processing (NLP)...

30
Emerging
25 uma-pi1/OPIEC

Reading the data from OPIEC - an Open Information Extraction corpus

30
Emerging
26 federicarollo/Italian-Crime-News

A dataset from the Gazzetta di Modena newspaper about crime events in the...

29
Experimental
27 cambridgeltl/cometa

Corpus of Online Medical EnTities: the cometA corpus

29
Experimental
28 csebuetnlp/banglabert

This repository contains the official release of the model "BanglaBERT" and...

29
Experimental
29 banglanlp/bnlp-resources

Awesome datasets for Bangla language computing.

28
Experimental
30 zhanlaoban/NLP_PEMDC

NLP Predtrained Embeddings, Models and Datasets Collections(NLP_PEMDC). The...

27
Experimental
31 UsmanNiazi/DUC-2004-Dataset

This Repo Contains the DUC 2004 Dataset

27
Experimental
32 jacklanda/CCAE

[NLPCC 2023] CCAE: A Corpus of Chinese-based Asian Englishes

26
Experimental
33 MuhammadYaseenKhan/Urdu-Sentiment-Corpus

Labelled Dataset for Urdu Sentiment Analysis

25
Experimental
34 Sueza-project/Sueza_project

Linguistic database collection for the revitalization of Cameroonian local...

24
Experimental
35 anoopkunchukuttan/meteor_indic

METEOR for Indian languages (originally forked from METEOR 1.4)

24
Experimental
36 crux82/huric

HuRIC 2.0 - the Human Robot Interaction Corpus

24
Experimental
37 s-bose/Walks-into-a-bar-dataset

A dataset containing 1000+ walks-into-a-bar jokes scraped from the internet.

24
Experimental
38 mussacharles60/swahili-dictionary

Swahili dictionary for implementing in your projects

24
Experimental
39 EthioNLP/Resource

This repository contains research papers and datasets for different NLP...

23
Experimental
40 Riccorl/nlp-dataset-readers

Readers for NLP Datasets

23
Experimental
41 lanwuwei/Twitter-URL-Corpus

Large scale sentential paraphrases collection and annotation

23
Experimental
42 mrpeerat/Thai-Sentence-Vector-Benchmark

Benchmark for Thai sentence representation

22
Experimental
43 kili-technology/awesome-datasets

A comprehensive list of annotated training datasets classified by use case.

22
Experimental
44 UKPLab/useb

Heterogenous, Task- and Domain-Specific Benchmark for Unsupervised Sentence...

22
Experimental
45 hrgupta/indian-scriptures

This repository contains various Indian scriptures 📜 in a structured .csv...

22
Experimental
46 COS301-SE-2025/Mafoko

Mafoko is a progressive web app (PWA) that provides access to multilingual...

21
Experimental
47 reem-codes/ArMATH

ArMATH: The Arabic Math Word Problem dataset. Accepted in LREC2022

20
Experimental
48 maxent-ai/Datasets

datasets with text data for use in NLP, Text analysis, information...

20
Experimental
49 mapmeld/hindi-bert

Hindi NLP work

20
Experimental
50 t-systems-on-site-services-gmbh/german-elmo-model

This is a german ELMo deep contextualized word representation. It is trained...

20
Experimental
51 massanishi/hackernews-post-datasets

Datasets for hackernews posts

20
Experimental
52 kassemsabeh/open-brand

The dataset contains over 250k product brand-value annotations with more...

20
Experimental
53 Hironsan/wiki-article-dataset

Wikipedia article dataset

20
Experimental
54 aalok-sathe/sentspace

a module to obtain diverse real-world-grounded features for sentences for...

19
Experimental
55 Pogayo/Luo-News-Dataset

This repo contains LUO corpus for Named Entity Recognition. The text comes...

19
Experimental
56 hyunwoongko/nlp-datasets

Curation note of NLP datasets

19
Experimental
57 SuzanaK/language_datasets

Language Datasets for NLP, Machine Learning, and Map Creation

19
Experimental
58 NetworkTheoryAppliedResearchInstitute/anthropology-

Comprehensive AI training corpus for anthropology education: 580K tokens...

19
Experimental
59 VLa-Labs/Danish-Language-Dataset-List

A curated metadata collection of 31 publicly available Danish language datasets.

18
Experimental
60 quality-attributes/datasets

Official data sources for the Quality Attributes project

18
Experimental
61 OumaimaHourrane/MA_Open_Datasets

Moroccan NLP Datasets and Corpora

16
Experimental
62 nlp-waseda/comet-atomic-ja

COMET-ATOMIC ja

16
Experimental
63 filbench/filbench-eval

Experiments and Analyses for FilBench: An Open LLM Leaderboard for Filipino...

16
Experimental
64 megagonlabs/ebe-dataset

Evidence-based Explanation Dataset (AACL-IJCNLP 2020)

15
Experimental
65 mohansaidinesh/Datasets

Datasets for Machine Learning

14
Experimental
66 ART-Group-it/GASP

GASP! Dataset - Generating Abstracts of Scientific Papers from Abstracts of...

14
Experimental
67 pln-fing-udelar/humor

HUMOR dataset for humor research

14
Experimental
68 davidwarrior22/machine-translation-for-african-languages

This repository focuses on developing machine translation and NLP tools...

14
Experimental
69 aviaefrat/cryptonite

The Official Repository of the Cryptonite Dataset

14
Experimental
70 viperx-20/awesome-sentence-embedding

A curated list of pretrained sentence and word embedding models

13
Experimental
71 kaisugi/datasets-for-sequential-sentence-classification

Curated list of public datasets which focus on sentence classification in...

13
Experimental
72 Niger-Volta-LTI/urhobo-text

Urhobo language training text for NLP, ASR and TTS tasks

13
Experimental
73 jahidulzaid/BanglaNostalgia

A benchmark and training pipeline for detecting nostalgia in Bangla text....

13
Experimental
74 dsfsi/project-state-capture

Zondo Commission or State Capture Commission Transcripts

12
Experimental
75 mzmmoazam/kashmiri_dataset

Data and tool to fetch kashmiri text

12
Experimental
76 createmomo/supporting-comedy-writers

Predicting Audience’s Response from Sketch Comedy and Crosstalk Scripts (A...

12
Experimental
77 bluechoochoo/retired_comedy_phrases

A Casual Spreadsheets resource

12
Experimental
78 Archaeocomputers/Bessarion

A text and imaging dataset of Byzantine-era Medieval Greek inscriptions.

12
Experimental
79 jonas-becker/pd-human-vs-machine-content

The official repository for the paper "Paraphrase Detection: Human vs....

12
Experimental
80 slvnwhrl/sigmorphon2022-models

This repository contains the models used by the CLUZH team for the...

12
Experimental
81 CyberAgentAILab/AdParaphrase

This repository contains data for our paper "AdParaphrase: Paraphrase...

12
Experimental
82 ICPSR/dataset-references

NER pipeline to detect dataset references for ASIST 2022 paper

12
Experimental
83 KushtrimVisoka/Kosovo-Parliament-Transcriptions

NOTE: The dataset is maintained exclusively on HuggingFace Datasets. The...

12
Experimental
84 OpenCENIA/SRN

Spanish Resources and Evaluation

12
Experimental
85 dsfsi/PuoData

Curated corpora for Setswana. Used to train PuoBERTa.

12
Experimental
86 radi-cho/noisy-sentences-dataset

550K sentences in 5 European languages augmented with noise for training and...

11
Experimental
87 NoelShallum/all-indian-acts

Repository containing all Indian Acts and statutes in the PDF and txt...

11
Experimental
88 rmdodhia/dataset-detection

Detects datasets used in journal papers

11
Experimental
89 dsfsi/zabantu-beta

ZaBantu is a fleet of light-weight Masked Language Models for Southern Bantu...

11
Experimental
90 MusfiqDehan/bn-en-aligner

Tool to easily align Bangla and English words from sentences

11
Experimental
91 BrianMsane/siSwati-Datasets

Repository for siSwati NLP datasets which I have worked on in my research....

10
Experimental
92 felixgiov/public-meeting

Dataset from the paper "Information Extraction from Public Meeting Articles"

10
Experimental
93 metriccoders/metriccoders_datasets

This is the Metric Coders repository containing all the datasets for machine...

10
Experimental