sentencepiece and sentencepiece-jni
The JNI wrapper is a Java language binding that enables direct access to the core SentencePiece tokenizer library, making them complements designed to be used together rather than alternatives.
About sentencepiece
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
Implements both byte-pair-encoding (BPE) and unigram language model algorithms with subword regularization techniques to improve model robustness. Operates directly on raw Unicode text without requiring language-specific preprocessing, and provides end-to-end vocabulary-to-ID mapping with NFKC normalization. Available as self-contained C++ and Python libraries that achieve ~50k sentences/sec throughput while maintaining consistent tokenization across deployments.
About sentencepiece-jni
levyfan/sentencepiece-jni
Java JNI wrapper for SentencePiece: unsupervised text tokenizer for Neural Network-based text generation.
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work