bheinzerling/bpemb
Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
Embeddings are trained on Wikipedia and exposed as gensim KeyedVectors, enabling direct similarity queries and vector lookups with configurable vocabulary sizes (1k–200k) that control subword granularity. The library uses SentencePiece for tokenization and supports both subword segmentation and embedding lookup via a single Python API with automatic model downloading.
1,221 stars. No commits in the last 6 months.
Stars
1,221
Forks
102
Language
Python
License
MIT
Category
Last pushed
Oct 01, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/bheinzerling/bpemb"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
embeddings-benchmark/mteb
MTEB: Massive Text Embedding Benchmark
yannvgn/laserembeddings
LASER multilingual sentence embeddings as a pip package
harmonydata/harmony
The Harmony Python library: a research tool for psychologists to harmonise data and...
embeddings-benchmark/results
Data for the MTEB leaderboard
MilaNLProc/honest
A Python package to compute HONEST, a score to measure hurtful sentence completions in language...