NLP Corpus Datasets NLP Tools
Curated collections, loaders, and databases of text corpora for NLP research and training. Includes corpus compilation tools, domain-specific annotated datasets, and corpus management systems. Does NOT include tools for corpus analysis, linguistic annotation frameworks, or applications built on top of corpora.
There are 74 nlp corpus datasets tools tracked. 3 score above 50 (established tier). The highest-rated is Helsinki-NLP/OpusFilter at 64/100 with 115 stars and 532 monthly downloads.
Get all 74 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=nlp-corpus-datasets&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
Helsinki-NLP/OpusFilter
OpusFilter - Parallel corpus processing toolkit |
|
Established |
| 2 |
natasha/corus
Links to Russian corpora + Python functions for loading and parsing |
|
Established |
| 3 |
SergeyShk/ruTS
Библиотека для извлечения статистик из текстов на русском языке. |
|
Established |
| 4 |
natasha/nerus
Large silver standart Russian corpus with NER, morphology and syntax markup |
|
Emerging |
| 5 |
darija-open-dataset/dataset
darija <-> english dataset |
|
Emerging |
| 6 |
omicsNLP/Auto-CORPus
Auto-CORPus pipeline developed by a University of Nottingham and Imperial... |
|
Emerging |
| 7 |
texttechnologylab/UCE
The Unified Corpus Explorer (UCE) for UIMA-annotated Corpora. |
|
Emerging |
| 8 |
fido-ai/ua-datasets
A collection of datasets for Ukrainian language |
|
Emerging |
| 9 |
texttechnologylab/GerParCor
German Parliamentary Corpus (GerParCor) |
|
Emerging |
| 10 |
Koziev/NLP_Datasets
My NLP datasets for Russian language |
|
Emerging |
| 11 |
bureaucratic-labs/dostoevsky
Sentiment analysis library for russian language |
|
Emerging |
| 12 |
M4t1ss/parallel-corpora-tools
Tools for filtering and cleaning parallel and monolingual corpora for... |
|
Emerging |
| 13 |
JuliaText/CorpusLoaders.jl
A variety of loaders for various NLP corpora. |
|
Emerging |
| 14 |
notesjor/corpusexplorer2.0
Korpuslinguistik war noch nie so einfach... |
|
Emerging |
| 15 |
ericleasemorgan/reader
Distant Reader, a tool for using & understanding a corpus |
|
Emerging |
| 16 |
JonathanReeve/corpus-db
A textual corpus database for the digital humanities. |
|
Emerging |
| 17 |
rashiedomar/somali-wikipedia-corpus
Cleaned Somali Wikipedia corpus (~9,500 articles) for NLP, LLM training, and... |
|
Emerging |
| 18 |
adbar/German-NLP
Curated list of open-access/open-source/off-the-shelf resources and tools... |
|
Emerging |
| 19 |
microsoft/Clandestino
Repository for the Clandestino corpus |
|
Emerging |
| 20 |
josecannete/spanish-corpora
Unannotated Spanish 3 Billion Words Corpora |
|
Emerging |
| 21 |
yutkin/Lenta.Ru-News-Dataset
Corpus of Russian news articles collected from Lenta.Ru |
|
Experimental |
| 22 |
maxoodf/russian_news_corpus
Russian mass media stemmed texts corpus / Корпус лемматизированных... |
|
Experimental |
| 23 |
t-systems-on-site-services-gmbh/german-wikipedia-text-corpus
This is a german text corpus from Wikipedia. It is cleaned, preprocessed and... |
|
Experimental |
| 24 |
KurdishBLARK/InterdialectCorpus
A parallel corpus of Sorani, Kurmanji and English |
|
Experimental |
| 25 |
practikpharma/PGxCorpus
PGxCorpus, a manually annotated corpus, designed for the extraction of... |
|
Experimental |
| 26 |
ilinguistics/corpus_similarity
Measure the similarity of text corpora for 74 languages |
|
Experimental |
| 27 |
velkadamban/Tamil-Corpus
This nTamil project aims to create a comprehensive and high-quality... |
|
Experimental |
| 28 |
madhav1k/OpenCorpus
A multilingual compilation of open-source textual corpora across major &... |
|
Experimental |
| 29 |
ajithalbus/TamilCorpus
Open Source Tamil Corpus of 58M words |
|
Experimental |
| 30 |
microsoft/BrevE-CLaro
Repository for the BrevE and CLaro datasets. |
|
Experimental |
| 31 |
davide-ghidelli-business/OpenCorpus
OpenCorpus is a collection of open-source textual corpora from various... |
|
Experimental |
| 32 |
somosnlp/corpus-es
Lista de corpus de PLN en español ✨ #Somos600M: Ayuda a desarrollar IA... |
|
Experimental |
| 33 |
Digital-Pushkin-Lab/RuAdapt_Word_Lists
Word alignments from Russian-Simple Russian parallel data |
|
Experimental |
| 34 |
SpydazWebAI-NLP/BasicCorpus2023
A Basic Corpus Object , Giving Positional Encoding / Decoding . ,A Fully... |
|
Experimental |
| 35 |
notesjor/CorpusExplorer.Terminal.Console
Erlaubt anderen Programmen/Programmiersprachen den Zugriff auf... |
|
Experimental |
| 36 |
ilinguistics/earthLings
Corpus-based language and dialect mapping |
|
Experimental |
| 37 |
stdlib-js/datasets-moby-dick
The text of Moby Dick by Herman Melville. |
|
Experimental |
| 38 |
juwiragiye/ikirundi
The Ikirundi Corpus Project aims to create a comprehensive collection of... |
|
Experimental |
| 39 |
kscanne/chichewa
NLP resources for Chichewa |
|
Experimental |
| 40 |
d0rj/RusLit
📚 A small collection of Russian literature 📚 |
|
Experimental |
| 41 |
kateryna-bobrovnyk/ukr-twi-corpus
A corpus of Ukrainian Twitter texts + instructions for downloading and... |
|
Experimental |
| 42 |
Kartikaggarwal98/Indian_ParallelCorpus
Curated list of publicly available parallel corpus for Indian Languages |
|
Experimental |
| 43 |
DFKI-NLP/product-corpus
This repository contains the DFKI Product Corpus, a dataset of 174 documents... |
|
Experimental |
| 44 |
AlexKly/Detailed-NER-Dataset-RU
Labeled Russian text token-by-token for training models for NER task based... |
|
Experimental |
| 45 |
NLP-UMUTeam/Spanish-MisoCorpus-2020
Spanish MisoCorpus 2020 |
|
Experimental |
| 46 |
gambolputty/textstelle
Textstelle is a collection of corpora for the creation of bots and other... |
|
Experimental |
| 47 |
Digital-Pushkin-Lab/RuAdapt
A Parallel Russian-Simple Russian Dataset |
|
Experimental |
| 48 |
SaiedAlshahrani/performance-implications
Performance Implications of Using Unrepresentative Corpora in Arabic Natural... |
|
Experimental |
| 49 |
karen-pal/borges
Datasets de los textos de cuentos de varios autorxs latinoamericanxs.... |
|
Experimental |
| 50 |
DOLMA-NLP/PARME
Parallel corpora for Middle Eastern languages - ACL2025 |
|
Experimental |
| 51 |
AsoSoft/AsoSoft-Text-Corpus
AsoSoft Text Corpus is the first large scale text corpus for the Kurdish language. |
|
Experimental |
| 52 |
derintelligence/en-az-parallel-corpus
English-Azerbaijani parallel language corpus |
|
Experimental |
| 53 |
josealzugaray/cayetana-corpus
NLP analysis of 107 political speech transcripts (TEI-XML corpus) — topic... |
|
Experimental |
| 54 |
mannefedov/ru_kw_eval_datasets
Datasets for evaluation of keyword extraction in Russian |
|
Experimental |
| 55 |
steventan0110/align-filter
Repository for "Bitext Mining for Low-Resource Languages via Contrastive Learning" |
|
Experimental |
| 56 |
KurdishBLARK/KTC
Kurdish Textbooks Corpus |
|
Experimental |
| 57 |
madrugado/gia-corpus
Corpus of exam tests for 9-graders in Russia for NLP/ML purposes |
|
Experimental |
| 58 |
mideind/GreynirCorpus
A large treebank of parsed Icelandic text |
|
Experimental |
| 59 |
ixa-ehu/cometa
Website of the CoMeta, a Corpus for Metaphor Detection in Spanish |
|
Experimental |
| 60 |
CoffeBank/Ru-hard-detection-dataset
Ru AI-text detection dataset / Русскоязычный датасет для оценки детекции... |
|
Experimental |
| 61 |
Ofis-publik-ar-brezhoneg/breton-french-corpus
Korpus divyezhek brezhoneg-galleg - Bilingual Breton-French corpus |
|
Experimental |
| 62 |
lirondos/coalas
COrpus of AngLicisms in the SpAnish PresS (COALAS) 🐨 |
|
Experimental |
| 63 |
SaiedAlshahrani/Wikipedia-Corpora-Report
Wikipedia Corpora Meta Report: A Metadata Report of How Wikipedia Editions... |
|
Experimental |
| 64 |
Tentakl3/Spanish-NLP-preprocessing
Customized tokenization and preprocessing of Natural Language in Spanish -... |
|
Experimental |
| 65 |
TianciGao/RussScholar-Seeker
RussScholar-Seeker:A Python package for predicting whether a name is Russian... |
|
Experimental |
| 66 |
CesarJNP/Depression-corpus-spanish
Spanish depression-labeled corpus (0/1) |
|
Experimental |
| 67 |
techiaith/corpws-meincnodi-rhannau-ymadrodd
Corpws ar gyfer meincnodi tagwyr rhannau ymadrodd Cymraeg | A corpus for... |
|
Experimental |
| 68 |
peghaz/corpora-intersector
High-performance tool to find corpus words missing from large texts using... |
|
Experimental |
| 69 |
Uyghur-Corpus/Uyghur-Corpus
Large-scale Uyghur corpus optimized for Large Language Models (LLMs) and NLP... |
|
Experimental |
| 70 |
DusunDictionary/dusun-english-malay-corpus
A linguistic corpus of the Dusun language for NMT and LLM training |
|
Experimental |
| 71 |
BrightXiaoHan/Yitextor
Parallel corpus processing toolkit forked from... |
|
Experimental |
| 72 |
mosesab/Corpus-based-synonym-finder
Finds the synonym of words in a language using a language corpus |
|
Experimental |
| 73 |
progmatix21/Chilka
A corpus server library with a document database backend. |
|
Experimental |
| 74 |
hassanzadehmahdi/BioPersianWikiAnalyzer
Persian Wikipedia Bioinformatics Page Crawler and Text Preprocessor |
|
Experimental |