Chinese Text Corpora NLP Tools
Large-scale Chinese language datasets and text collections organized by domain (literature, social media, news, etc.). Includes lexicons, word lists, and annotated datasets. Does NOT include tools for processing corpora, embeddings training, or non-Chinese language resources.
There are 48 chinese text corpora tools tracked. 6 score above 50 (established tier). The highest-rated is esbatmop/MNBVC at 60/100 with 4,144 stars. 3 of the top 10 are actively maintained.
Get all 48 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=chinese-text-corpora&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
esbatmop/MNBVC
MNBVC(Massive Never-ending BT Vast Chinese... |
|
Established |
| 2 |
brightmart/nlp_chinese_corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP |
|
Established |
| 3 |
houbb/sensitive-word
👮♂️The sensitive word tool for java.(敏感词/违禁词/违法词/脏词。基于 DFA 算法实现的高性能 java... |
|
Established |
| 4 |
NateScarlet/holiday-cn
📅🇨🇳中国法定节假日数据 自动每日抓取国务院公告 |
|
Established |
| 5 |
sagorbrur/bnlp
BNLP is a natural language processing toolkit for Bengali Language. |
|
Established |
| 6 |
crownpku/Awesome-Chinese-NLP
A curated list of resources for Chinese NLP 中文自然语言处理相关资料 |
|
Established |
| 7 |
thunlp/THUOCL
THUOCL(THU Open Chinese Lexicon)中文词库 |
|
Emerging |
| 8 |
Kabir5296/banglanlptoolkit
Bangla NLP toolkit: Bangla text normalization, punctuation generation and... |
|
Emerging |
| 9 |
hiDaDeng/Chinese-Pretrained-Word-Embeddings
中文文本分析工具、语料、预训练模型相关资源汇总。 |
|
Emerging |
| 10 |
BengaliAI/bengaliAnalyzer
This module helps to analyze Bengali sentences. It can analyze various... |
|
Emerging |
| 11 |
shjwudp/shu
中文书籍收录整理, Collection of Chinese Books |
|
Emerging |
| 12 |
sheepzh/poetry
地球上最全的华语现代诗歌语料库,3k+诗人,80K+诗歌,15M+字 |
|
Emerging |
| 13 |
CLUEbenchmark/CLUEDatasetSearch
搜索所有中文NLP数据集,附常用英文NLP数据集 |
|
Emerging |
| 14 |
yuanjie-ai/ChineseSensitiveVocabulary
暴恐违禁 文本色情 政治敏感 恶意推广 低俗辱骂 |
|
Emerging |
| 15 |
Foysal87/Bangla-NLP-Dataset
Bangla NLP dataset. Bangla NER,POStag, text summarization, stopword,... |
|
Emerging |
| 16 |
cjymz886/find-Chinese-medical-words
发现新词 无监督词库生成 医学词库生成 发现未登录词 |
|
Emerging |
| 17 |
CanCLID/canto-filter
粵文語料篩選器 Cantonese text filter |
|
Emerging |
| 18 |
renfei/dict
中文词库/词典,可用于NLP项目、分词等场景 |
|
Emerging |
| 19 |
nonamestreet/weixin_public_corpus
微信公众号语料库 |
|
Emerging |
| 20 |
guotong1988/chinese_dictionary
同义词表,反义词表,否定词表 |
|
Emerging |
| 21 |
osyvokon/awesome-ukrainian-nlp
Curated list of Ukrainian natural language processing (NLP) resources... |
|
Emerging |
| 22 |
secsilm/zi-dataset
汉字数据集,包括汉字的相关信息,例如笔画数、部首、拼音、英文释义/同义词等。 |
|
Emerging |
| 23 |
GanjinZero/awesome_Chinese_medical_NLP
中文医学NLP公开资源整理:术语集/语料库/词向量/预训练模型/知识图谱/命名实体识别/QA/信息抽取/模型/论文/etc |
|
Emerging |
| 24 |
SongRb/DeepDiveChineseApps
DeepDive Tutorial with Chinese Support |
|
Emerging |
| 25 |
didi/ChineseNLP
Datasets, SOTA results of every fields of Chinese NLP |
|
Emerging |
| 26 |
howl-anderson/MITIE_Chinese_Wikipedia_corpus
Pre-trained Wikipedia corpus by MITIE |
|
Emerging |
| 27 |
zispace/hanzi-words
汉语常用词表 |
|
Emerging |
| 28 |
hantang/data-corpus
语料数据和词库收集:中文、英文停用词,情感分析,分类词典,敏感词库(违禁词,审查词)。stop words, sentiment analysis,... |
|
Emerging |
| 29 |
xtea/chinese_medical_words
手工整理医疗行业词汇、术语等语料。可用于语音识别、对话系统等各类nlp模型训练。 |
|
Emerging |
| 30 |
guhhhhaa/4675-scifi
chinese NLP corpus of chinese science fiction,chinese science fiction corpus... |
|
Emerging |
| 31 |
guhhhhaa/wula-scifi
chinese NLP corpus of chinese science fiction, chinese science fiction... |
|
Emerging |
| 32 |
ko-nlp/Open-korean-corpora
Open Korean NLP Dataset Curation for the Users All Around the Globe |
|
Emerging |
| 33 |
modernmt/profanity-filter
Simple and fast dictionary-based multi-language profanity filter written in Java |
|
Experimental |
| 34 |
Tanat05/korean-profanity-resources
한국어 욕설, 비속어, 혐오 표현(offensive language) 관련 데이터셋, 라이브러리, API를 모아놓은 리소스 저장소 |
|
Experimental |
| 35 |
KehaoWu/Jinyong-Corpus
金庸15部小说字典 |
|
Experimental |
| 36 |
kevinhu/hotpot
A lightweight Chinese-English dictionary |
|
Experimental |
| 37 |
jilelab/social-media-chinese-words
社交媒体中文词库。 涵盖社交媒体领域特有的专有名词和新词。 |
|
Experimental |
| 38 |
readme-SVG/Banned-words
🤬🗯️ Multilingual profanity & banned word lists with a browser-based editor... |
|
Experimental |
| 39 |
Koukotsukan/Weibo-Trending-Names-Corpus
20201124到20220710期间的微博热搜中出现过的姓名 (主要为明星、政客、名人、网红、企业家等) |
|
Experimental |
| 40 |
JiangYanting/Chinese_book_dataset
中文图书数据集/数据挖掘/自然语言处理/中国图书分类法/图书情报学/数据挖掘/文本分类/ |
|
Experimental |
| 41 |
SunsetMkt/netease-sensitive-words
网易游戏的离线敏感词库(正则) |
|
Experimental |
| 42 |
JasonShao55/Chinese_Metaphor_Explanation
An annotated Chinese metaphor dataset |
|
Experimental |
| 43 |
alexeyev/awesome-azerbaijani-nlp
Azerbaijani language processing software, models and datasets. |
|
Experimental |
| 44 |
direct-phonology/jdsw
Parsing the "Jingdian Shiwen" with spaCy |
|
Experimental |
| 45 |
Koukotsukan/Chinese-and-Korean-Star-Name-in-Chinese-Corpus
中韩明星中文姓名语料库 |
|
Experimental |
| 46 |
JiangYanting/Word_list_dataset_terminology
术语词典数据集/分词词典/专业词表语料库/词汇知识库/领域词表下载/主题词表/词库/自然语言处理/数据挖掘/深度学习 |
|
Experimental |
| 47 |
bangla-toolkit/core
A set of tools written in TypeScript to work with the Bangla language. |
|
Experimental |
| 48 |
nurulhudaapon/bangla.js
A set of tools written in TypeScript to work with the Bangla language. |
|
Experimental |