sentencepiece and Tokenizer

SentencePiece is a standalone tokenization algorithm/implementation that OpenNMT/Tokenizer wraps and integrates alongside BPE as one of several supported tokenization backends within a broader translation framework.

sentencepiece

Verified

Tokenizer

Established

Maintenance 13/25

Adoption 25/25

Maturity 25/25

Community 21/25

Maintenance 10/25

Adoption 10/25

Maturity 16/25

Community 23/25

Stars: 11,697

Forks: 1,333

Downloads: 33,078,873

Commits (30d): 2

Language: C++

License: Apache-2.0

Stars: 330

Forks: 80

Downloads: —

Commits (30d): 0

Language: C++

License: MIT

No risk flags

No Package No Dependents

About sentencepiece

google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

Implements both byte-pair-encoding (BPE) and unigram language model algorithms with subword regularization techniques to improve model robustness. Operates directly on raw Unicode text without requiring language-specific preprocessing, and provides end-to-end vocabulary-to-ID mapping with NFKC normalization. Available as self-contained C++ and Python libraries that achieve ~50k sentences/sec throughput while maintaining consistent tokenization across deployments.

About Tokenizer

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

Related comparisons

sentencepiece and YouTokenToMe sentencepiece and sentencepiece-jni sentencepiece and kitoken sentencepiece and YouTokenToMe

Scores updated daily from GitHub, PyPI, and npm data. How scores work