twang2218/vocab-coverage
语言模型中文认知能力分析
Analyzes Chinese character recognition across language models through three complementary dimensions: character coverage rates against standard dictionaries (8,105–20,992 characters), tokenizer behavior patterns (WordPiece vs. byte-level BPE), and embedding space distributions via t-SNE visualization. The toolkit provides CLI commands (`charset`, `coverage`, `embedding`) to inspect vocabulary composition, measure character recognition depth (distinguishing single-token vs. fragmented representations), and assess whether token embeddings retain semantic structure or remain in random initialization. Comparative analysis spans BERT variants, ERNIE, multilingual models, LLaMA derivatives, and commercial APIs, revealing how tokenization strategies and training data affect Chinese language understanding.
236 stars and 42 monthly downloads. No commits in the last 6 months. Available on PyPI.
Stars
236
Forks
24
Language
Python
License
Apache-2.0
Category
Last pushed
Sep 09, 2023
Monthly downloads
42
Commits (30d)
0
Dependencies
17
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/twang2218/vocab-coverage"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
MinishLab/model2vec
Fast State-of-the-Art Static Embeddings
AnswerDotAI/ModernBERT
Bringing BERT into modernity via both architecture changes and scaling
Santosh-Gupta/SpeedTorch
Library for faster pinned CPU <-> GPU transfer in Pytorch
Embedding/Chinese-Word-Vectors
100+ Chinese Word Vectors 上百种预训练中文词向量
tensorflow/hub
A library for transfer learning by reusing parts of TensorFlow models.