Chinese Text Corpora NLP Tools

Large-scale Chinese language datasets and text collections organized by domain (literature, social media, news, etc.). Includes lexicons, word lists, and annotated datasets. Does NOT include tools for processing corpora, embeddings training, or non-Chinese language resources.

There are 48 chinese text corpora tools tracked. 6 score above 50 (established tier). The highest-rated is esbatmop/MNBVC at 60/100 with 4,144 stars. 3 of the top 10 are actively maintained.

Get all 48 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=chinese-text-corpora&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 esbatmop/MNBVC

MNBVC(Massive Never-ending BT Vast Chinese...

60
Established
2 brightmart/nlp_chinese_corpus

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

59
Established
3 houbb/sensitive-word

👮‍♂️The sensitive word tool for java.(敏感词/违禁词/违法词/脏词。基于 DFA 算法实现的高性能 java...

57
Established
4 NateScarlet/holiday-cn

📅🇨🇳中国法定节假日数据 自动每日抓取国务院公告

55
Established
5 sagorbrur/bnlp

BNLP is a natural language processing toolkit for Bengali Language.

52
Established
6 crownpku/Awesome-Chinese-NLP

A curated list of resources for Chinese NLP 中文自然语言处理相关资料

51
Established
7 thunlp/THUOCL

THUOCL(THU Open Chinese Lexicon)中文词库

44
Emerging
8 Kabir5296/banglanlptoolkit

Bangla NLP toolkit: Bangla text normalization, punctuation generation and...

42
Emerging
9 hiDaDeng/Chinese-Pretrained-Word-Embeddings

中文文本分析工具、语料、预训练模型相关资源汇总。

41
Emerging
10 BengaliAI/bengaliAnalyzer

This module helps to analyze Bengali sentences. It can analyze various...

41
Emerging
11 shjwudp/shu

中文书籍收录整理, Collection of Chinese Books

40
Emerging
12 sheepzh/poetry

地球上最全的华语现代诗歌语料库,3k+诗人,80K+诗歌,15M+字

40
Emerging
13 CLUEbenchmark/CLUEDatasetSearch

搜索所有中文NLP数据集,附常用英文NLP数据集

40
Emerging
14 yuanjie-ai/ChineseSensitiveVocabulary

暴恐违禁 文本色情 政治敏感 恶意推广 低俗辱骂

38
Emerging
15 Foysal87/Bangla-NLP-Dataset

Bangla NLP dataset. Bangla NER,POStag, text summarization, stopword,...

38
Emerging
16 cjymz886/find-Chinese-medical-words

发现新词 无监督词库生成 医学词库生成 发现未登录词

37
Emerging
17 CanCLID/canto-filter

粵文語料篩選器 Cantonese text filter

36
Emerging
18 renfei/dict

中文词库/词典,可用于NLP项目、分词等场景

36
Emerging
19 nonamestreet/weixin_public_corpus

微信公众号语料库

36
Emerging
20 guotong1988/chinese_dictionary

同义词表,反义词表,否定词表

36
Emerging
21 osyvokon/awesome-ukrainian-nlp

Curated list of Ukrainian natural language processing (NLP) resources...

36
Emerging
22 secsilm/zi-dataset

汉字数据集,包括汉字的相关信息,例如笔画数、部首、拼音、英文释义/同义词等。

35
Emerging
23 GanjinZero/awesome_Chinese_medical_NLP

中文医学NLP公开资源整理:术语集/语料库/词向量/预训练模型/知识图谱/命名实体识别/QA/信息抽取/模型/论文/etc

34
Emerging
24 SongRb/DeepDiveChineseApps

DeepDive Tutorial with Chinese Support

34
Emerging
25 didi/ChineseNLP

Datasets, SOTA results of every fields of Chinese NLP

33
Emerging
26 howl-anderson/MITIE_Chinese_Wikipedia_corpus

Pre-trained Wikipedia corpus by MITIE

33
Emerging
27 zispace/hanzi-words

汉语常用词表

33
Emerging
28 hantang/data-corpus

语料数据和词库收集:中文、英文停用词,情感分析,分类词典,敏感词库(违禁词,审查词)。stop words, sentiment analysis,...

33
Emerging
29 xtea/chinese_medical_words

手工整理医疗行业词汇、术语等语料。可用于语音识别、对话系统等各类nlp模型训练。

32
Emerging
30 guhhhhaa/4675-scifi

chinese NLP corpus of chinese science fiction,chinese science fiction corpus...

32
Emerging
31 guhhhhaa/wula-scifi

chinese NLP corpus of chinese science fiction, chinese science fiction...

31
Emerging
32 ko-nlp/Open-korean-corpora

Open Korean NLP Dataset Curation for the Users All Around the Globe

31
Emerging
33 modernmt/profanity-filter

Simple and fast dictionary-based multi-language profanity filter written in Java

29
Experimental
34 Tanat05/korean-profanity-resources

한국어 욕설, 비속어, 혐오 표현(offensive language) 관련 데이터셋, 라이브러리, API를 모아놓은 리소스 저장소

27
Experimental
35 KehaoWu/Jinyong-Corpus

金庸15部小说字典

27
Experimental
36 kevinhu/hotpot

A lightweight Chinese-English dictionary

25
Experimental
37 jilelab/social-media-chinese-words

社交媒体中文词库。 涵盖社交媒体领域特有的专有名词和新词。

24
Experimental
38 readme-SVG/Banned-words

🤬🗯️ Multilingual profanity & banned word lists with a browser-based editor...

23
Experimental
39 Koukotsukan/Weibo-Trending-Names-Corpus

20201124到20220710期间的微博热搜中出现过的姓名 (主要为明星、政客、名人、网红、企业家等)

23
Experimental
40 JiangYanting/Chinese_book_dataset

中文图书数据集/数据挖掘/自然语言处理/中国图书分类法/图书情报学/数据挖掘/文本分类/

22
Experimental
41 SunsetMkt/netease-sensitive-words

网易游戏的离线敏感词库(正则)

19
Experimental
42 JasonShao55/Chinese_Metaphor_Explanation

An annotated Chinese metaphor dataset

15
Experimental
43 alexeyev/awesome-azerbaijani-nlp

Azerbaijani language processing software, models and datasets.

14
Experimental
44 direct-phonology/jdsw

Parsing the "Jingdian Shiwen" with spaCy

12
Experimental
45 Koukotsukan/Chinese-and-Korean-Star-Name-in-Chinese-Corpus

中韩明星中文姓名语料库

12
Experimental
46 JiangYanting/Word_list_dataset_terminology

术语词典数据集/分词词典/专业词表语料库/词汇知识库/领域词表下载/主题词表/词库/自然语言处理/数据挖掘/深度学习

12
Experimental
47 bangla-toolkit/core

A set of tools written in TypeScript to work with the Bangla language.

11
Experimental
48 nurulhudaapon/bangla.js

A set of tools written in TypeScript to work with the Bangla language.

11
Experimental