twang2218/vocab-coverage

语言模型中文认知能力分析

54
/ 100
Established

Analyzes Chinese character recognition across language models through three complementary dimensions: character coverage rates against standard dictionaries (8,105–20,992 characters), tokenizer behavior patterns (WordPiece vs. byte-level BPE), and embedding space distributions via t-SNE visualization. The toolkit provides CLI commands (`charset`, `coverage`, `embedding`) to inspect vocabulary composition, measure character recognition depth (distinguishing single-token vs. fragmented representations), and assess whether token embeddings retain semantic structure or remain in random initialization. Comparative analysis spans BERT variants, ERNIE, multilingual models, LLaMA derivatives, and commercial APIs, revealing how tokenization strategies and training data affect Chinese language understanding.

236 stars and 42 monthly downloads. No commits in the last 6 months. Available on PyPI.

Stale 6m
Maintenance 0 / 25
Adoption 14 / 25
Maturity 25 / 25
Community 15 / 25

How are scores calculated?

Stars

236

Forks

24

Language

Python

License

Apache-2.0

Last pushed

Sep 09, 2023

Monthly downloads

42

Commits (30d)

0

Dependencies

17

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/twang2218/vocab-coverage"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.