Tokenizer Libraries Transformer Models
Libraries and implementations for tokenization across programming languages and frameworks. Includes tokenizer training, conversion, alignment, and optimization. Does NOT include higher-level NLP tasks, token classification, or downstream language model applications.
There are 20 tokenizer libraries models tracked. 1 score above 70 (verified tier). The highest-rated is huggingface/tokenizers at 90/100 with 10,520 stars and 129,702,376 monthly downloads. 1 of the top 10 are actively maintained.
Get all 20 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=tokenizer-libraries&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Model | Score | Tier |
|---|---|---|---|
| 1 |
huggingface/tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production |
|
Verified |
| 2 |
megagonlabs/ginza-transformers
Use custom tokenizers in spacy-transformers |
|
Emerging |
| 3 |
Kaleidophon/token2index
A lightweight but powerful library to build token indices for NLP tasks,... |
|
Emerging |
| 4 |
NVIDIA/Cosmos-Tokenizer
A suite of image and video neural tokenizers |
|
Emerging |
| 5 |
Hugging-Face-Supporter/tftokenizers
Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels |
|
Emerging |
| 6 |
wangcongcong123/ttt
A package for fine-tuning Transformers with TPUs, written in Tensorflow2.0+ |
|
Emerging |
| 7 |
nlpodyssey/gotokenizers
Go implementation of today's most used tokenizers |
|
Emerging |
| 8 |
Beomi/megatronlm_dataset_autotokenizer
Megatron-LM/GPT-NeoX compatible Text Encoder with 🤗Transformers AutoTokenizer. |
|
Experimental |
| 9 |
Mbeeee111/tokenizer.cpp
📦 Optimize tokenization in C++ for HuggingFace models with a fast,... |
|
Experimental |
| 10 |
mazebrr/language-tokenizer
🧩 Tokenize text efficiently across multiple languages using our robust... |
|
Experimental |
| 11 |
technion-cs-nlp/BiologicalTokenizers
Effect of tokenization on transformers for biological sequence |
|
Experimental |
| 12 |
dnbaker/bioseq
Tokenizers and Machine Learning Models for biological sequence data |
|
Experimental |
| 13 |
muna-ai/libtokenizers
C/C++ bindings from Huggingface Tokenizers. |
|
Experimental |
| 14 |
JaydenTeoh/beyond-next-token-prediction
Curated collection of research on the limitations of next-token prediction... |
|
Experimental |
| 15 |
symanto-research/merge-tokenizers
Package to align tokens from different tokenizations. |
|
Experimental |
| 16 |
hikmatazimzade/azerbaijani-tokenizer
High-Performance Azerbaijani Tokenizers (30% fewer tokens, 40% faster than... |
|
Experimental |
| 17 |
Mecanik/Tiny-BPE-Trainer
Lightweight, header-only Byte Pair Encoding (BPE) trainer in modern C++17.... |
|
Experimental |
| 18 |
Systemcluster/tokenizer
General tokenizer library for the Web and Node. Supports Huggingface and... |
|
Experimental |
| 19 |
LauryneL/Pipographe-v2
Pipographe v2.0.0 est une application web reposant sur une base de données... |
|
Experimental |
| 20 |
NotShrirang/marathi-tokenizer
🖋️ A sleek, BPE-powered tokenizer that understands the richness of Marathi. |
|
Experimental |