NLP Corpus Datasets NLP Tools

Curated collections, loaders, and databases of text corpora for NLP research and training. Includes corpus compilation tools, domain-specific annotated datasets, and corpus management systems. Does NOT include tools for corpus analysis, linguistic annotation frameworks, or applications built on top of corpora.

There are 74 nlp corpus datasets tools tracked. 3 score above 50 (established tier). The highest-rated is Helsinki-NLP/OpusFilter at 64/100 with 115 stars and 532 monthly downloads.

Get all 74 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=nlp-corpus-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	Helsinki-NLP/OpusFilter OpusFilter - Parallel corpus processing toolkit	64	Established	115	Python
2	natasha/corus Links to Russian corpora + Python functions for loading and parsing	58	Established	310	Jupyter Notebook
3	SergeyShk/ruTS Библиотека для извлечения статистик из текстов на русском языке.	50	Established	125	Python
4	natasha/nerus Large silver standart Russian corpus with NER, morphology and syntax markup	48	Emerging	73	Python
5	darija-open-dataset/dataset darija <-> english dataset	45	Emerging	363	—
6	omicsNLP/Auto-CORPus Auto-CORPus pipeline developed by a University of Nottingham and Imperial...	45	Emerging	22	HTML
7	texttechnologylab/UCE The Unified Corpus Explorer (UCE) for UIMA-annotated Corpora.	45	Emerging	7	Java
8	fido-ai/ua-datasets A collection of datasets for Ukrainian language	44	Emerging	56	Python
9	texttechnologylab/GerParCor German Parliamentary Corpus (GerParCor)	43	Emerging	30	Java
10	Koziev/NLP_Datasets My NLP datasets for Russian language	38	Emerging	386	C#
11	bureaucratic-labs/dostoevsky Sentiment analysis library for russian language	36	Emerging	320	Python
12	M4t1ss/parallel-corpora-tools Tools for filtering and cleaning parallel and monolingual corpora for...	34	Emerging	41	PHP
13	JuliaText/CorpusLoaders.jl A variety of loaders for various NLP corpora.	34	Emerging	32	Julia
14	notesjor/corpusexplorer2.0 Korpuslinguistik war noch nie so einfach...	33	Emerging	25	C#
15	ericleasemorgan/reader Distant Reader, a tool for using & understanding a corpus	31	Emerging	20	Shell
16	JonathanReeve/corpus-db A textual corpus database for the digital humanities.	31	Emerging	63	Jupyter Notebook
17	rashiedomar/somali-wikipedia-corpus Cleaned Somali Wikipedia corpus (~9,500 articles) for NLP, LLM training, and...	31	Emerging	5	—
18	adbar/German-NLP Curated list of open-access/open-source/off-the-shelf resources and tools...	30	Emerging	518	—
19	microsoft/Clandestino Repository for the Clandestino corpus	30	Emerging	10	—
20	josecannete/spanish-corpora Unannotated Spanish 3 Billion Words Corpora	30	Emerging	104	Python
21	yutkin/Lenta.Ru-News-Dataset Corpus of Russian news articles collected from Lenta.Ru	29	Experimental	145	Python
22	maxoodf/russian_news_corpus Russian mass media stemmed texts corpus / Корпус лемматизированных...	29	Experimental	93	—
23	t-systems-on-site-services-gmbh/german-wikipedia-text-corpus This is a german text corpus from Wikipedia. It is cleaned, preprocessed and...	29	Experimental	23	—
24	KurdishBLARK/InterdialectCorpus A parallel corpus of Sorani, Kurmanji and English	29	Experimental	15	—
25	practikpharma/PGxCorpus PGxCorpus, a manually annotated corpus, designed for the extraction of...	28	Experimental	8	Lua
26	ilinguistics/corpus_similarity Measure the similarity of text corpora for 74 languages	28	Experimental	14	Python
27	velkadamban/Tamil-Corpus This nTamil project aims to create a comprehensive and high-quality...	26	Experimental	5	Roff
28	madhav1k/OpenCorpus A multilingual compilation of open-source textual corpora across major &...	26	Experimental	4	—
29	ajithalbus/TamilCorpus Open Source Tamil Corpus of 58M words	26	Experimental	11	Shell
30	microsoft/BrevE-CLaro Repository for the BrevE and CLaro datasets.	24	Experimental	4	—
31	davide-ghidelli-business/OpenCorpus OpenCorpus is a collection of open-source textual corpora from various...	23	Experimental	1	—
32	somosnlp/corpus-es Lista de corpus de PLN en español ✨ #Somos600M: Ayuda a desarrollar IA...	23	Experimental	25	Python
33	Digital-Pushkin-Lab/RuAdapt_Word_Lists Word alignments from Russian-Simple Russian parallel data	23	Experimental	6	—
34	SpydazWebAI-NLP/BasicCorpus2023 A Basic Corpus Object , Giving Positional Encoding / Decoding . ,A Fully...	23	Experimental	1	Visual Basic .NET
35	notesjor/CorpusExplorer.Terminal.Console Erlaubt anderen Programmen/Programmiersprachen den Zugriff auf...	23	Experimental	7	C#
36	ilinguistics/earthLings Corpus-based language and dialect mapping	23	Experimental	7	—
37	stdlib-js/datasets-moby-dick The text of Moby Dick by Herman Melville.	22	Experimental	4	JavaScript
38	juwiragiye/ikirundi The Ikirundi Corpus Project aims to create a comprehensive collection of...	22	Experimental	1	Python
39	kscanne/chichewa NLP resources for Chichewa	21	Experimental	10	Makefile
40	d0rj/RusLit 📚 A small collection of Russian literature 📚	21	Experimental	13	—
41	kateryna-bobrovnyk/ukr-twi-corpus A corpus of Ukrainian Twitter texts + instructions for downloading and...	21	Experimental	15	Jupyter Notebook
42	Kartikaggarwal98/Indian_ParallelCorpus Curated list of publicly available parallel corpus for Indian Languages	20	Experimental	37	—
43	DFKI-NLP/product-corpus This repository contains the DFKI Product Corpus, a dataset of 174 documents...	20	Experimental	12	—
44	AlexKly/Detailed-NER-Dataset-RU Labeled Russian text token-by-token for training models for NER task based...	19	Experimental	10	Python
45	NLP-UMUTeam/Spanish-MisoCorpus-2020 Spanish MisoCorpus 2020	19	Experimental	—	—
46	gambolputty/textstelle Textstelle is a collection of corpora for the creation of bots and other...	18	Experimental	21	—
47	Digital-Pushkin-Lab/RuAdapt A Parallel Russian-Simple Russian Dataset	17	Experimental	15	—
48	SaiedAlshahrani/performance-implications Performance Implications of Using Unrepresentative Corpora in Arabic Natural...	16	Experimental	3	Jupyter Notebook
49	karen-pal/borges Datasets de los textos de cuentos de varios autorxs latinoamericanxs....	16	Experimental	16	Jupyter Notebook
50	DOLMA-NLP/PARME Parallel corpora for Middle Eastern languages - ACL2025	15	Experimental	8	Python
51	AsoSoft/AsoSoft-Text-Corpus AsoSoft Text Corpus is the first large scale text corpus for the Kurdish language.	15	Experimental	27	—
52	derintelligence/en-az-parallel-corpus English-Azerbaijani parallel language corpus	15	Experimental	20	—
53	josealzugaray/cayetana-corpus NLP analysis of 107 political speech transcripts (TEI-XML corpus) — topic...	14	Experimental	—	HTML
54	mannefedov/ru_kw_eval_datasets Datasets for evaluation of keyword extraction in Russian	14	Experimental	31	—
55	steventan0110/align-filter Repository for "Bitext Mining for Low-Resource Languages via Contrastive Learning"	13	Experimental	5	Python
56	KurdishBLARK/KTC Kurdish Textbooks Corpus	13	Experimental	8	—
57	madrugado/gia-corpus Corpus of exam tests for 9-graders in Russia for NLP/ML purposes	13	Experimental	8	—
58	mideind/GreynirCorpus A large treebank of parsed Icelandic text	13	Experimental	8	—
59	ixa-ehu/cometa Website of the CoMeta, a Corpus for Metaphor Detection in Spanish	12	Experimental	4	Python
60	CoffeBank/Ru-hard-detection-dataset Ru AI-text detection dataset / Русскоязычный датасет для оценки детекции...	12	Experimental	1	—
61	Ofis-publik-ar-brezhoneg/breton-french-corpus Korpus divyezhek brezhoneg-galleg - Bilingual Breton-French corpus	12	Experimental	1	—
62	lirondos/coalas COrpus of AngLicisms in the SpAnish PresS (COALAS) 🐨	12	Experimental	4	—
63	SaiedAlshahrani/Wikipedia-Corpora-Report Wikipedia Corpora Meta Report: A Metadata Report of How Wikipedia Editions...	11	Experimental	2	Python
64	Tentakl3/Spanish-NLP-preprocessing Customized tokenization and preprocessing of Natural Language in Spanish -...	11	Experimental	2	Python
65	TianciGao/RussScholar-Seeker RussScholar-Seeker：A Python package for predicting whether a name is Russian...	11	Experimental	2	Python
66	CesarJNP/Depression-corpus-spanish Spanish depression-labeled corpus (0/1)	11	Experimental	—	Python
67	techiaith/corpws-meincnodi-rhannau-ymadrodd Corpws ar gyfer meincnodi tagwyr rhannau ymadrodd Cymraeg \| A corpus for...	11	Experimental	—	—
68	peghaz/corpora-intersector High-performance tool to find corpus words missing from large texts using...	11	Experimental	—	OCaml
69	Uyghur-Corpus/Uyghur-Corpus Large-scale Uyghur corpus optimized for Large Language Models (LLMs) and NLP...	11	Experimental	—	—
70	DusunDictionary/dusun-english-malay-corpus A linguistic corpus of the Dusun language for NMT and LLM training	11	Experimental	—	—
71	BrightXiaoHan/Yitextor Parallel corpus processing toolkit forked from...	11	Experimental	2	Python
72	mosesab/Corpus-based-synonym-finder Finds the synonym of words in a language using a language corpus	10	Experimental	1	Python
73	progmatix21/Chilka A corpus server library with a document database backend.	10	Experimental	1	Python
74	hassanzadehmahdi/BioPersianWikiAnalyzer Persian Wikipedia Bioinformatics Page Crawler and Text Preprocessor	10	Experimental	1	Jupyter Notebook