NLP Corpus Datasets NLP Tools

Curated collections, loaders, and databases of text corpora for NLP research and training. Includes corpus compilation tools, domain-specific annotated datasets, and corpus management systems. Does NOT include tools for corpus analysis, linguistic annotation frameworks, or applications built on top of corpora.

There are 74 nlp corpus datasets tools tracked. 3 score above 50 (established tier). The highest-rated is Helsinki-NLP/OpusFilter at 64/100 with 115 stars and 532 monthly downloads.

Get all 74 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=nlp-corpus-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 Helsinki-NLP/OpusFilter

OpusFilter - Parallel corpus processing toolkit

64
Established
2 natasha/corus

Links to Russian corpora + Python functions for loading and parsing

58
Established
3 SergeyShk/ruTS

Библиотека для извлечения статистик из текстов на русском языке.

50
Established
4 natasha/nerus

Large silver standart Russian corpus with NER, morphology and syntax markup

48
Emerging
5 darija-open-dataset/dataset

darija <-> english dataset

45
Emerging
6 omicsNLP/Auto-CORPus

Auto-CORPus pipeline developed by a University of Nottingham and Imperial...

45
Emerging
7 texttechnologylab/UCE

The Unified Corpus Explorer (UCE) for UIMA-annotated Corpora.

45
Emerging
8 fido-ai/ua-datasets

A collection of datasets for Ukrainian language

44
Emerging
9 texttechnologylab/GerParCor

German Parliamentary Corpus (GerParCor)

43
Emerging
10 Koziev/NLP_Datasets

My NLP datasets for Russian language

38
Emerging
11 bureaucratic-labs/dostoevsky

Sentiment analysis library for russian language

36
Emerging
12 M4t1ss/parallel-corpora-tools

Tools for filtering and cleaning parallel and monolingual corpora for...

34
Emerging
13 JuliaText/CorpusLoaders.jl

A variety of loaders for various NLP corpora.

34
Emerging
14 notesjor/corpusexplorer2.0

Korpuslinguistik war noch nie so einfach...

33
Emerging
15 ericleasemorgan/reader

Distant Reader, a tool for using & understanding a corpus

31
Emerging
16 JonathanReeve/corpus-db

A textual corpus database for the digital humanities.

31
Emerging
17 rashiedomar/somali-wikipedia-corpus

Cleaned Somali Wikipedia corpus (~9,500 articles) for NLP, LLM training, and...

31
Emerging
18 adbar/German-NLP

Curated list of open-access/open-source/off-the-shelf resources and tools...

30
Emerging
19 microsoft/Clandestino

Repository for the Clandestino corpus

30
Emerging
20 josecannete/spanish-corpora

Unannotated Spanish 3 Billion Words Corpora

30
Emerging
21 yutkin/Lenta.Ru-News-Dataset

Corpus of Russian news articles collected from Lenta.Ru

29
Experimental
22 maxoodf/russian_news_corpus

Russian mass media stemmed texts corpus / Корпус лемматизированных...

29
Experimental
23 t-systems-on-site-services-gmbh/german-wikipedia-text-corpus

This is a german text corpus from Wikipedia. It is cleaned, preprocessed and...

29
Experimental
24 KurdishBLARK/InterdialectCorpus

A parallel corpus of Sorani, Kurmanji and English

29
Experimental
25 practikpharma/PGxCorpus

PGxCorpus, a manually annotated corpus, designed for the extraction of...

28
Experimental
26 ilinguistics/corpus_similarity

Measure the similarity of text corpora for 74 languages

28
Experimental
27 velkadamban/Tamil-Corpus

This nTamil project aims to create a comprehensive and high-quality...

26
Experimental
28 madhav1k/OpenCorpus

A multilingual compilation of open-source textual corpora across major &...

26
Experimental
29 ajithalbus/TamilCorpus

Open Source Tamil Corpus of 58M words

26
Experimental
30 microsoft/BrevE-CLaro

Repository for the BrevE and CLaro datasets.

24
Experimental
31 davide-ghidelli-business/OpenCorpus

OpenCorpus is a collection of open-source textual corpora from various...

23
Experimental
32 somosnlp/corpus-es

Lista de corpus de PLN en español ✨ #Somos600M: Ayuda a desarrollar IA...

23
Experimental
33 Digital-Pushkin-Lab/RuAdapt_Word_Lists

Word alignments from Russian-Simple Russian parallel data

23
Experimental
34 SpydazWebAI-NLP/BasicCorpus2023

A Basic Corpus Object , Giving Positional Encoding / Decoding . ,A Fully...

23
Experimental
35 notesjor/CorpusExplorer.Terminal.Console

Erlaubt anderen Programmen/Programmiersprachen den Zugriff auf...

23
Experimental
36 ilinguistics/earthLings

Corpus-based language and dialect mapping

23
Experimental
37 stdlib-js/datasets-moby-dick

The text of Moby Dick by Herman Melville.

22
Experimental
38 juwiragiye/ikirundi

The Ikirundi Corpus Project aims to create a comprehensive collection of...

22
Experimental
39 kscanne/chichewa

NLP resources for Chichewa

21
Experimental
40 d0rj/RusLit

📚 A small collection of Russian literature 📚

21
Experimental
41 kateryna-bobrovnyk/ukr-twi-corpus

A corpus of Ukrainian Twitter texts + instructions for downloading and...

21
Experimental
42 Kartikaggarwal98/Indian_ParallelCorpus

Curated list of publicly available parallel corpus for Indian Languages

20
Experimental
43 DFKI-NLP/product-corpus

This repository contains the DFKI Product Corpus, a dataset of 174 documents...

20
Experimental
44 AlexKly/Detailed-NER-Dataset-RU

Labeled Russian text token-by-token for training models for NER task based...

19
Experimental
45 NLP-UMUTeam/Spanish-MisoCorpus-2020

Spanish MisoCorpus 2020

19
Experimental
46 gambolputty/textstelle

Textstelle is a collection of corpora for the creation of bots and other...

18
Experimental
47 Digital-Pushkin-Lab/RuAdapt

A Parallel Russian-Simple Russian Dataset

17
Experimental
48 SaiedAlshahrani/performance-implications

Performance Implications of Using Unrepresentative Corpora in Arabic Natural...

16
Experimental
49 karen-pal/borges

Datasets de los textos de cuentos de varios autorxs latinoamericanxs....

16
Experimental
50 DOLMA-NLP/PARME

Parallel corpora for Middle Eastern languages - ACL2025

15
Experimental
51 AsoSoft/AsoSoft-Text-Corpus

AsoSoft Text Corpus is the first large scale text corpus for the Kurdish language.

15
Experimental
52 derintelligence/en-az-parallel-corpus

English-Azerbaijani parallel language corpus

15
Experimental
53 josealzugaray/cayetana-corpus

NLP analysis of 107 political speech transcripts (TEI-XML corpus) — topic...

14
Experimental
54 mannefedov/ru_kw_eval_datasets

Datasets for evaluation of keyword extraction in Russian

14
Experimental
55 steventan0110/align-filter

Repository for "Bitext Mining for Low-Resource Languages via Contrastive Learning"

13
Experimental
56 KurdishBLARK/KTC

Kurdish Textbooks Corpus

13
Experimental
57 madrugado/gia-corpus

Corpus of exam tests for 9-graders in Russia for NLP/ML purposes

13
Experimental
58 mideind/GreynirCorpus

A large treebank of parsed Icelandic text

13
Experimental
59 ixa-ehu/cometa

Website of the CoMeta, a Corpus for Metaphor Detection in Spanish

12
Experimental
60 CoffeBank/Ru-hard-detection-dataset

Ru AI-text detection dataset / Русскоязычный датасет для оценки детекции...

12
Experimental
61 Ofis-publik-ar-brezhoneg/breton-french-corpus

Korpus divyezhek brezhoneg-galleg - Bilingual Breton-French corpus

12
Experimental
62 lirondos/coalas

COrpus of AngLicisms in the SpAnish PresS (COALAS) 🐨

12
Experimental
63 SaiedAlshahrani/Wikipedia-Corpora-Report

Wikipedia Corpora Meta Report: A Metadata Report of How Wikipedia Editions...

11
Experimental
64 Tentakl3/Spanish-NLP-preprocessing

Customized tokenization and preprocessing of Natural Language in Spanish -...

11
Experimental
65 TianciGao/RussScholar-Seeker

RussScholar-Seeker:A Python package for predicting whether a name is Russian...

11
Experimental
66 CesarJNP/Depression-corpus-spanish

Spanish depression-labeled corpus (0/1)

11
Experimental
67 techiaith/corpws-meincnodi-rhannau-ymadrodd

Corpws ar gyfer meincnodi tagwyr rhannau ymadrodd Cymraeg | A corpus for...

11
Experimental
68 peghaz/corpora-intersector

High-performance tool to find corpus words missing from large texts using...

11
Experimental
69 Uyghur-Corpus/Uyghur-Corpus

Large-scale Uyghur corpus optimized for Large Language Models (LLMs) and NLP...

11
Experimental
70 DusunDictionary/dusun-english-malay-corpus

A linguistic corpus of the Dusun language for NMT and LLM training

11
Experimental
71 BrightXiaoHan/Yitextor

Parallel corpus processing toolkit forked from...

11
Experimental
72 mosesab/Corpus-based-synonym-finder

Finds the synonym of words in a language using a language corpus

10
Experimental
73 progmatix21/Chilka

A corpus server library with a document database backend.

10
Experimental
74 hassanzadehmahdi/BioPersianWikiAnalyzer

Persian Wikipedia Bioinformatics Page Crawler and Text Preprocessor

10
Experimental