sentencepiece and YouTokenToMe

These are competitors offering alternative implementations of unsupervised subword tokenization (SentencePiece uses unigram language modeling while YouTokenToMe uses BPE), with SentencePiece dominating adoption in production NLP pipelines while YouTokenToMe targets use cases prioritizing inference speed over ecosystem integration.

sentencepiece
84
Verified
YouTokenToMe
46
Emerging
Maintenance 13/25
Adoption 25/25
Maturity 25/25
Community 21/25
Maintenance 0/25
Adoption 10/25
Maturity 16/25
Community 20/25
Stars: 11,697
Forks: 1,333
Downloads: 33,078,873
Commits (30d): 2
Language: C++
License: Apache-2.0
Stars: 975
Forks: 109
Downloads:
Commits (30d): 0
Language: C++
License: MIT
No risk flags
Archived Stale 6m No Package No Dependents

About sentencepiece

google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

Implements both byte-pair-encoding (BPE) and unigram language model algorithms with subword regularization techniques to improve model robustness. Operates directly on raw Unicode text without requiring language-specific preprocessing, and provides end-to-end vocabulary-to-ID mapping with NFKC normalization. Available as self-contained C++ and Python libraries that achieve ~50k sentences/sec throughput while maintaining consistent tokenization across deployments.

About YouTokenToMe

VKCOM/YouTokenToMe

Unsupervised text tokenizer focused on computational efficiency

Implements Byte Pair Encoding with O(N) complexity using multithreaded C++ backend and space-as-boundary tokenization (preserving word boundaries via "▁" meta-symbol). Provides Python bindings and CLI tools supporting BPE-dropout regularization and reversible encoding/decoding. Outperforms Hugging Face tokenizers, fastBPE, and SentencePiece by up to 60× on training and inference through efficient parallel processing.

Scores updated daily from GitHub, PyPI, and npm data. How scores work