CLUEbenchmark/CLUECorpus2020

Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料

/ 100

Established

Extracted from Common Crawl and cleaned to high quality standards, the corpus includes a specialized simplified Chinese vocabulary (8,021 tokens) optimized for NLP tasks, reducing token overhead compared to Google's multilingual vocab. Pre-formatted for direct use in BERT and masked language model training, experiments demonstrate competitive performance on CLUE benchmark tasks with equivalent or smaller data volumes.

1,002 stars.

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 17 / 25

How are scores calculated?

Stars

1,002

Forks

Language

—

License

MIT

Related tools

acl-org/acl-anthology

Data and software for building the ACL Anthology.

anoopkunchukuttan/indic_nlp_library

Resources and tools for Indian language Natural Language Processing

SudhirGadhvi/open-vernacular-ai-kit

Clean Indian code-mixed text before it reaches your LLM.

Separius/awesome-sentence-embedding

A curated list of pretrained sentence and word embedding models

KennethEnevoldsen/scandinavian-embedding-benchmark

A Scandinavian Benchmark for sentence embeddings

Explore NLP Tools

All categories Trending NLP directory Insights