google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

/ 100

Verified

Implements both byte-pair-encoding (BPE) and unigram language model algorithms with subword regularization techniques to improve model robustness. Operates directly on raw Unicode text without requiring language-specific preprocessing, and provides end-to-end vocabulary-to-ID mapping with NFKC normalization. Available as self-contained C++ and Python libraries that achieve ~50k sentences/sec throughput while maintaining consistent tokenization across deployments.

11,697 stars and 33,078,873 monthly downloads. Used by 194 other packages. Actively maintained with 2 commits in the last 30 days. Available on PyPI.

Maintenance 13 / 25

Adoption 25 / 25

Maturity 25 / 25

Community 21 / 25

How are scores calculated?

Stars

11,697

Forks

1,333

Language

C++

License

Apache-2.0

Compare

sentencepiece and Tokenizer sentencepiece and YouTokenToMe sentencepiece and sentencepiece-jni

Related tools

soaxelbrooke/python-bpe

Byte Pair Encoding for Python!

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

Systemcluster/kitoken

Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...

daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

taishi-i/toiro

A tool for comparing tokenizers

Explore NLP Tools

All categories Trending NLP directory Insights