Tokenization Algorithms NLP Tools
Tools and libraries for implementing tokenization algorithms (BPE, WordPiece, SentencePiece, Unigram, byte-level) across various programming languages. Includes tokenizer implementations, benchmarks, and algorithm variants. Does NOT include downstream NLP tasks, language models, or applications that use tokenizers.
There are 57 tokenization algorithms tools tracked. 1 score above 70 (verified tier). The highest-rated is google/sentencepiece at 84/100 with 11,697 stars and 33,078,873 monthly downloads. 1 of the top 10 are actively maintained.
Get all 57 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=tokenization-algorithms&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation. |
|
Verified |
| 2 |
soaxelbrooke/python-bpe
Byte Pair Encoding for Python! |
|
Established |
| 3 |
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support |
|
Established |
| 4 |
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with... |
|
Established |
| 5 |
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer |
|
Established |
| 6 |
taishi-i/toiro
A tool for comparing tokenizers |
|
Established |
| 7 |
LanguageMachines/ucto
Unicode tokeniser. Ucto tokenizes text files: it separates words from... |
|
Established |
| 8 |
daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer |
|
Established |
| 9 |
proycon/python-ucto
This is a Python binding to the tokenizer Ucto. Tokenisation is one of the... |
|
Established |
| 10 |
VKCOM/YouTokenToMe
Unsupervised text tokenizer focused on computational efficiency |
|
Emerging |
| 11 |
JuliaText/WordTokenizers.jl
High performance tokenizers for natural language processing and other related tasks |
|
Emerging |
| 12 |
bnosac/sentencepiece
R package for Byte Pair Encoding / Unigram modelling based on Sentencepiece |
|
Emerging |
| 13 |
ropensci/tokenizers
Fast, Consistent Tokenization of Natural Language Text |
|
Emerging |
| 14 |
levyfan/sentencepiece-jni
Java JNI wrapper for SentencePiece: unsupervised text tokenizer for Neural... |
|
Emerging |
| 15 |
arbox/tokenizer
A simple tokenizer in Ruby for NLP tasks. |
|
Emerging |
| 16 |
dariush-bahrami/character-tokenizer
A character tokenizer for Hugging Face Transformers |
|
Emerging |
| 17 |
jorge-menjivar/tekken-rs
Rust implementation of the Mistral Tekken tokenizer |
|
Emerging |
| 18 |
Moshe-ship/artok
Arabic Token Tax Calculator - see how much more Arabic costs across LLM tokenizers |
|
Emerging |
| 19 |
zencephalon/Tactful_Tokenizer
Accurate Bayesian sentence tokenizer in Ruby. |
|
Emerging |
| 20 |
JuliaStrings/TinySegmenter.jl
Julia version of TinySegmenter, compact Japanese tokenizer |
|
Emerging |
| 21 |
thisiscetin/textoken
Simple and customizable text tokenization gem. |
|
Emerging |
| 22 |
dustalov/greeb
Greeb is a simple Unicode-aware regexp-based tokenizer. |
|
Emerging |
| 23 |
daac-tools/python-vaporetto
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer.... |
|
Emerging |
| 24 |
ztjhz/word-piece-tokenizer
A Lightweight Word Piece Tokenizer |
|
Emerging |
| 25 |
chengchingwen/BytePairEncoding.jl
Julia implementation of Byte Pair Encoding for NLP |
|
Experimental |
| 26 |
skorani/tokenizer
An open source High level Persian Tokenizer |
|
Experimental |
| 27 |
daac-tools/python-vibrato
Viterbi-based accelerated tokenizer (Python wrapper) |
|
Experimental |
| 28 |
savannstm/language-tokenizer
Text tokenizer for linguistic purposes, such as text matching. Supports more... |
|
Experimental |
| 29 |
10-OASIS-01/BPEtokenizer
This project implements a tokenizer based on the Byte Pair Encoding (BPE)... |
|
Experimental |
| 30 |
gbenson/dom-tokenizers
DOM-aware tokenization for Hugging Face language models |
|
Experimental |
| 31 |
pranav271103/Ultra-Tokenizer
This project implements a state-of-the-art tokenizer from scratch in Python,... |
|
Experimental |
| 32 |
ImadSaddik/DarijaTokenizers
Free to use tokenizers trained on the Darija language. |
|
Experimental |
| 33 |
North-Shore-AI/tiktoken_ex
Pure Elixir TikToken-style byte-level BPE tokenizer (Kimi K2 compatible). |
|
Experimental |
| 34 |
AddyDelaCruz/swift-tiktoken
🎉 Implement a lightweight, pure Swift tokenizer for OpenAI's tiktoken,... |
|
Experimental |
| 35 |
chaablo69/rustbpe
🔧 Train efficient BPE tokenizers in Rust with simple Python bindings,... |
|
Experimental |
| 36 |
scientist-labs/tokenkit
Fast, Rust-backed word-level tokenization for Ruby. Unlike subword... |
|
Experimental |
| 37 |
tommasofacchin/ft-tokenize
Small C++ tokenizer with support for word-level and BPE tokenization,... |
|
Experimental |
| 38 |
designer-coderajay/bpe-tokenizer-scratch
Byte-Pair Encoding tokenizer built from scratch in Python. The same... |
|
Experimental |
| 39 |
UtkarshTheDev/tokenizer
Interactive BPE (Byte-Pair Encoding) tokenizer and CLI utility for... |
|
Experimental |
| 40 |
shivendrra/shredword-trainer
BPE & Unigram Vocab Training library |
|
Experimental |
| 41 |
hppRC/saku
A Japanese Sentence Tokenizer written in Rust. |
|
Experimental |
| 42 |
riyad-derguini/End-to-End-NLP-Systems
Modular toolkit for End-to-End NLP: Implementing advanced subword... |
|
Experimental |
| 43 |
yenniejun/tokenizers-languages
Comparing LLM tokenizers in multiple languages |
|
Experimental |
| 44 |
michaelnmmeyer/mascara
A natural language tokenizer |
|
Experimental |
| 45 |
dongjinleekr/beanpiece
A Java binding to Google SentencePiece |
|
Experimental |
| 46 |
CarolinElsner/Speech-Tokenization
The tokenisation of spoken text. Received by the Watson STT and sent to the... |
|
Experimental |
| 47 |
kiarashrahmani/English-Persian-Tokenizer
This project is a simple tokenizer for text processing that can tokenize... |
|
Experimental |
| 48 |
SeanLee97/BertWordPieceTokenizer.jl
WordPiece Tokenizer for BERT models. |
|
Experimental |
| 49 |
edoardosignoroni/hftoks-eval
High Frequency Tokenizer - Evaluation |
|
Experimental |
| 50 |
jonasliendl/bpe_tokenizer
✨ BPE-Tokenizer for university module Foundational Generative Models. |
|
Experimental |
| 51 |
nicogabriel1708/rust-tokenizer
An efficient text tokenization library featuring various models, written in Rust. |
|
Experimental |
| 52 |
teleprint-me/byte-pair
Byte Pair Encoder (BPE) for Natural Language Processing. |
|
Experimental |
| 53 |
victor-iyi/wikitext
Train and perform NLP tasks on the wikitext-103 dataset in Rust |
|
Experimental |
| 54 |
hscspring/bytepiece-rs
The Bytepiece Tokenizer Implemented in Rust. |
|
Experimental |
| 55 |
delph-in/repp
Regular Expression Preprocessor |
|
Experimental |
| 56 |
jonasknobloch/tokenizers-mbpe
Morphologically biased byte-pair encoding pre-tokenization |
|
Experimental |
| 57 |
Textualization/RophertaTokenizer
BPE Tokenizer for Ropherta (subclass of GPT3Tokenizer) |
|
Experimental |