bheinzerling/bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)

/ 100

Emerging

Embeddings are trained on Wikipedia and exposed as gensim KeyedVectors, enabling direct similarity queries and vector lookups with configurable vocabulary sizes (1k–200k) that control subword granularity. The library uses SentencePiece for tokenization and supports both subword segmentation and embedding lookup via a single Python API with automatic model downloading.

1,221 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 9 / 25

Community 18 / 25

How are scores calculated?

Stars

1,221

Forks

102

Language

Python

License

MIT

Featured in

Embeddings Are Easier Than Whatever You're Doing Instead You're Shipping AI You Can't Measure

Higher-rated alternatives

embeddings-benchmark/mteb

MTEB: Massive Text Embedding Benchmark

yannvgn/laserembeddings

LASER multilingual sentence embeddings as a pip package

harmonydata/harmony

The Harmony Python library: a research tool for psychologists to harmonise data and...

embeddings-benchmark/results

Data for the MTEB leaderboard

MilaNLProc/honest

A Python package to compute HONEST, a score to measure hurtful sentence completions in language...

Explore Embedding Tools

All categories Trending Embeddings directory Insights