sentencepiece and Tokenizer
SentencePiece is a standalone tokenization algorithm/implementation that OpenNMT/Tokenizer wraps and integrates alongside BPE as one of several supported tokenization backends within a broader translation framework.
About sentencepiece
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
Implements both byte-pair-encoding (BPE) and unigram language model algorithms with subword regularization techniques to improve model robustness. Operates directly on raw Unicode text without requiring language-specific preprocessing, and provides end-to-end vocabulary-to-ID mapping with NFKC normalization. Available as self-contained C++ and Python libraries that achieve ~50k sentences/sec throughput while maintaining consistent tokenization across deployments.
About Tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work