esbatmop/MNBVC

MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。

60
/ 100
Established

4,144 stars. Actively maintained with 2 commits in the last 30 days.

No Package No Dependents
Maintenance 16 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 18 / 25

How are scores calculated?

Stars

4,144

Forks

287

Language

License

MIT

Last pushed

Mar 08, 2026

Commits (30d)

2

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/esbatmop/MNBVC"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.