sefineh-ai/Amharic-Tokenizer

Syllable-aware BPE tokenizer for the Amharic language (አማርኛ) – fast, accurate, trainable.

43
/ 100
Emerging

Decomposes Amharic fidel characters into atomic components before applying BPE merges, then recomposes during detokenization to guarantee perfect round-trip fidelity. Cython-optimized core accelerates the cleaning→decomposition→tokenization→detokenization pipeline. Provides both a CLI tool and Python API with pretrained vocab (30K tokens, `amh_bpe_v0.2.6`) alongside full trainability on custom corpora.

No Package No Dependents
Maintenance 6 / 25
Adoption 9 / 25
Maturity 13 / 25
Community 15 / 25

How are scores calculated?

Stars

98

Forks

14

Language

Python

License

MIT

Category

bpe-tokenizers

Last pushed

Nov 17, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/sefineh-ai/Amharic-Tokenizer"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.