sefineh-ai/Amharic-Tokenizer

Syllable-aware BPE tokenizer for the Amharic language (አማርኛ) – fast, accurate, trainable.

/ 100

Emerging

Decomposes Amharic fidel characters into atomic components before applying BPE merges, then recomposes during detokenization to guarantee perfect round-trip fidelity. Cython-optimized core accelerates the cleaning→decomposition→tokenization→detokenization pipeline. Provides both a CLI tool and Python API with pretrained vocab (30K tokens, `amh_bpe_v0.2.6`) alongside full trainability on custom corpora.

No Package No Dependents

Maintenance 6 / 25

Adoption 9 / 25

Maturity 13 / 25

Community 15 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

eliben/go-sentencepiece

Go implementation of the SentencePiece tokenizer

mdabir1203/BPE_Tokenizer_Visualizer

A Visualizer to check how BPE Tokenizer in an LLM Works

U4RASD/r-bpe

R-BPE: Improving BPE-Tokenizers with Token Reuse

jmaczan/bpe-tokenizer

Byte-Pair Encoding tokenizer for training large language models on huge datasets

franciszekparma/GBPET

GPT-style language model with Byte Pair Encoding tokenizer, built from scratch in PyTorch.

Explore LLM Tools

All categories Trending LLM Tool directory Insights