Tokenizer and YouTokenToMe

These are **competitors** — both provide standalone, general-purpose tokenization solutions (BPE/SentencePiece vs. unsupervised methods) for the same use case of preprocessing text, with no integration points between them.

Tokenizer

Established

YouTokenToMe

Emerging

Maintenance 10/25

Adoption 10/25

Maturity 16/25

Community 23/25

Maintenance 0/25

Adoption 10/25

Maturity 16/25

Community 20/25

Stars: 330

Forks: 80

Downloads: —

Commits (30d): 0

Language: C++

License: MIT

Stars: 975

Forks: 109

Downloads: —

Commits (30d): 0

Language: C++

License: MIT

No Package No Dependents

Archived Stale 6m No Package No Dependents

About Tokenizer

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

About YouTokenToMe

VKCOM/YouTokenToMe

Unsupervised text tokenizer focused on computational efficiency

Implements Byte Pair Encoding with O(N) complexity using multithreaded C++ backend and space-as-boundary tokenization (preserving word boundaries via "▁" meta-symbol). Provides Python bindings and CLI tools supporting BPE-dropout regularization and reversible encoding/decoding. Outperforms Hugging Face tokenizers, fastBPE, and SentencePiece by up to 60× on training and inference through efficient parallel processing.

Related comparisons

Tokenizer and sentencepiece Tokenizer and kitoken Tokenizer and sentencepiece

Scores updated daily from GitHub, PyPI, and npm data. How scores work