jiangnanboy/llm_corpus_quality
大模型预训练中文语料清洗及质量评估 Large model pre-training corpus cleaning
28
/ 100
Experimental
No commits in the last 6 months.
No License
Stale 6m
No Package
No Dependents
Maintenance
0 / 25
Adoption
9 / 25
Maturity
8 / 25
Community
11 / 25
Stars
76
Forks
7
Language
Java
License
—
Category
Last pushed
Jul 25, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/jiangnanboy/llm_corpus_quality"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
mikahama/uralicNLP
An NLP library for Uralic languages such as Finnish, Skolt Sami, Moksha and so on. Also...
57
SkyworkAI/Skywork
Skywork series models are pre-trained on 3.2TB of high-quality multilingual (mainly Chinese and...
46
shamspias/lexsublm-lite
A laptop‑friendly toolkit for context‑aware single‑word paraphrasing and lexical‑substitution...
36