Tokenization Algorithms NLP Tools

Tools and libraries for implementing tokenization algorithms (BPE, WordPiece, SentencePiece, Unigram, byte-level) across various programming languages. Includes tokenizer implementations, benchmarks, and algorithm variants. Does NOT include downstream NLP tasks, language models, or applications that use tokenizers.

There are 57 tokenization algorithms tools tracked. 1 score above 70 (verified tier). The highest-rated is google/sentencepiece at 84/100 with 11,697 stars and 33,078,873 monthly downloads. 1 of the top 10 are actively maintained.

Get all 57 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=tokenization-algorithms&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

84
Verified
2 soaxelbrooke/python-bpe

Byte Pair Encoding for Python!

61
Established
3 OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

59
Established
4 Systemcluster/kitoken

Fast and versatile tokenizer for language models, compatible with...

58
Established
5 daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

57
Established
6 taishi-i/toiro

A tool for comparing tokenizers

57
Established
7 LanguageMachines/ucto

Unicode tokeniser. Ucto tokenizes text files: it separates words from...

56
Established
8 daac-tools/vaporetto

🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer

54
Established
9 proycon/python-ucto

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the...

53
Established
10 VKCOM/YouTokenToMe

Unsupervised text tokenizer focused on computational efficiency

46
Emerging
11 JuliaText/WordTokenizers.jl

High performance tokenizers for natural language processing and other related tasks

45
Emerging
12 bnosac/sentencepiece

R package for Byte Pair Encoding / Unigram modelling based on Sentencepiece

42
Emerging
13 ropensci/tokenizers

Fast, Consistent Tokenization of Natural Language Text

42
Emerging
14 levyfan/sentencepiece-jni

Java JNI wrapper for SentencePiece: unsupervised text tokenizer for Neural...

41
Emerging
15 arbox/tokenizer

A simple tokenizer in Ruby for NLP tasks.

41
Emerging
16 dariush-bahrami/character-tokenizer

A character tokenizer for Hugging Face Transformers

41
Emerging
17 jorge-menjivar/tekken-rs

Rust implementation of the Mistral Tekken tokenizer

40
Emerging
18 Moshe-ship/artok

Arabic Token Tax Calculator - see how much more Arabic costs across LLM tokenizers

39
Emerging
19 zencephalon/Tactful_Tokenizer

Accurate Bayesian sentence tokenizer in Ruby.

33
Emerging
20 JuliaStrings/TinySegmenter.jl

Julia version of TinySegmenter, compact Japanese tokenizer

32
Emerging
21 thisiscetin/textoken

Simple and customizable text tokenization gem.

32
Emerging
22 dustalov/greeb

Greeb is a simple Unicode-aware regexp-based tokenizer.

31
Emerging
23 daac-tools/python-vaporetto

🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer....

30
Emerging
24 ztjhz/word-piece-tokenizer

A Lightweight Word Piece Tokenizer

30
Emerging
25 chengchingwen/BytePairEncoding.jl

Julia implementation of Byte Pair Encoding for NLP

28
Experimental
26 skorani/tokenizer

An open source High level Persian Tokenizer

27
Experimental
27 daac-tools/python-vibrato

Viterbi-based accelerated tokenizer (Python wrapper)

27
Experimental
28 savannstm/language-tokenizer

Text tokenizer for linguistic purposes, such as text matching. Supports more...

26
Experimental
29 10-OASIS-01/BPEtokenizer

This project implements a tokenizer based on the Byte Pair Encoding (BPE)...

26
Experimental
30 gbenson/dom-tokenizers

DOM-aware tokenization for Hugging Face language models

25
Experimental
31 pranav271103/Ultra-Tokenizer

This project implements a state-of-the-art tokenizer from scratch in Python,...

25
Experimental
32 ImadSaddik/DarijaTokenizers

Free to use tokenizers trained on the Darija language.

24
Experimental
33 North-Shore-AI/tiktoken_ex

Pure Elixir TikToken-style byte-level BPE tokenizer (Kimi K2 compatible).

23
Experimental
34 AddyDelaCruz/swift-tiktoken

🎉 Implement a lightweight, pure Swift tokenizer for OpenAI's tiktoken,...

23
Experimental
35 chaablo69/rustbpe

🔧 Train efficient BPE tokenizers in Rust with simple Python bindings,...

22
Experimental
36 scientist-labs/tokenkit

Fast, Rust-backed word-level tokenization for Ruby. Unlike subword...

22
Experimental
37 tommasofacchin/ft-tokenize

Small C++ tokenizer with support for word-level and BPE tokenization,...

16
Experimental
38 designer-coderajay/bpe-tokenizer-scratch

Byte-Pair Encoding tokenizer built from scratch in Python. The same...

16
Experimental
39 UtkarshTheDev/tokenizer

Interactive BPE (Byte-Pair Encoding) tokenizer and CLI utility for...

15
Experimental
40 shivendrra/shredword-trainer

BPE & Unigram Vocab Training library

15
Experimental
41 hppRC/saku

A Japanese Sentence Tokenizer written in Rust.

14
Experimental
42 riyad-derguini/End-to-End-NLP-Systems

Modular toolkit for End-to-End NLP: Implementing advanced subword...

14
Experimental
43 yenniejun/tokenizers-languages

Comparing LLM tokenizers in multiple languages

13
Experimental
44 michaelnmmeyer/mascara

A natural language tokenizer

13
Experimental
45 dongjinleekr/beanpiece

A Java binding to Google SentencePiece

13
Experimental
46 CarolinElsner/Speech-Tokenization

The tokenisation of spoken text. Received by the Watson STT and sent to the...

12
Experimental
47 kiarashrahmani/English-Persian-Tokenizer

This project is a simple tokenizer for text processing that can tokenize...

12
Experimental
48 SeanLee97/BertWordPieceTokenizer.jl

WordPiece Tokenizer for BERT models.

12
Experimental
49 edoardosignoroni/hftoks-eval

High Frequency Tokenizer - Evaluation

11
Experimental
50 jonasliendl/bpe_tokenizer

✨ BPE-Tokenizer for university module Foundational Generative Models.

11
Experimental
51 nicogabriel1708/rust-tokenizer

An efficient text tokenization library featuring various models, written in Rust.

11
Experimental
52 teleprint-me/byte-pair

Byte Pair Encoder (BPE) for Natural Language Processing.

11
Experimental
53 victor-iyi/wikitext

Train and perform NLP tasks on the wikitext-103 dataset in Rust

10
Experimental
54 hscspring/bytepiece-rs

The Bytepiece Tokenizer Implemented in Rust.

10
Experimental
55 delph-in/repp

Regular Expression Preprocessor

10
Experimental
56 jonasknobloch/tokenizers-mbpe

Morphologically biased byte-pair encoding pre-tokenization

10
Experimental
57 Textualization/RophertaTokenizer

BPE Tokenizer for Ropherta (subclass of GPT3Tokenizer)

10
Experimental