shjwudp/c4-dataset-script

Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.

/ 100

Emerging

135 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 15 / 25

How are scores calculated?

Stars

135

Forks

Language

Python

License

MIT

Category

nlp-dataset-collections

Last pushed

Jun 07, 2023

Commits (30d)

GitHub

NLP Dataset Collections · 93 tools

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/shjwudp/c4-dataset-script"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

Higher-rated alternatives

acl-org/acl-anthology

Data and software for building the ACL Anthology.

anoopkunchukuttan/indic_nlp_library

Resources and tools for Indian language Natural Language Processing

CLUEbenchmark/CLUECorpus2020

Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料

SudhirGadhvi/open-vernacular-ai-kit

Clean Indian code-mixed text before it reaches your LLM.

Separius/awesome-sentence-embedding

A curated list of pretrained sentence and word embedding models

Explore NLP Tools

All categories Trending NLP directory Insights