sentencepiece and Tokenizer

SentencePiece is a standalone tokenization algorithm/implementation that OpenNMT/Tokenizer wraps and integrates alongside BPE as one of several supported tokenization backends within a broader translation framework.

sentencepiece
84
Verified
Tokenizer
59
Established
Maintenance 13/25
Adoption 25/25
Maturity 25/25
Community 21/25
Maintenance 10/25
Adoption 10/25
Maturity 16/25
Community 23/25
Stars: 11,697
Forks: 1,333
Downloads: 33,078,873
Commits (30d): 2
Language: C++
License: Apache-2.0
Stars: 330
Forks: 80
Downloads:
Commits (30d): 0
Language: C++
License: MIT
No risk flags
No Package No Dependents

About sentencepiece

google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

Implements both byte-pair-encoding (BPE) and unigram language model algorithms with subword regularization techniques to improve model robustness. Operates directly on raw Unicode text without requiring language-specific preprocessing, and provides end-to-end vocabulary-to-ID mapping with NFKC normalization. Available as self-contained C++ and Python libraries that achieve ~50k sentences/sec throughput while maintaining consistent tokenization across deployments.

About Tokenizer

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

Scores updated daily from GitHub, PyPI, and npm data. How scores work