Tokenizer and YouTokenToMe
These are **competitors** — both provide standalone, general-purpose tokenization solutions (BPE/SentencePiece vs. unsupervised methods) for the same use case of preprocessing text, with no integration points between them.
About Tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
About YouTokenToMe
VKCOM/YouTokenToMe
Unsupervised text tokenizer focused on computational efficiency
Implements Byte Pair Encoding with O(N) complexity using multithreaded C++ backend and space-as-boundary tokenization (preserving word boundaries via "▁" meta-symbol). Provides Python bindings and CLI tools supporting BPE-dropout regularization and reversible encoding/decoding. Outperforms Hugging Face tokenizers, fastBPE, and SentencePiece by up to 60× on training and inference through efficient parallel processing.
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work