Tokenizer Libraries Transformer Models

Libraries and implementations for tokenization across programming languages and frameworks. Includes tokenizer training, conversion, alignment, and optimization. Does NOT include higher-level NLP tasks, token classification, or downstream language model applications.

There are 20 tokenizer libraries models tracked. 1 score above 70 (verified tier). The highest-rated is huggingface/tokenizers at 90/100 with 10,520 stars and 129,702,376 monthly downloads. 1 of the top 10 are actively maintained.

Get all 20 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=tokenizer-libraries&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 huggingface/tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

90
Verified
2 megagonlabs/ginza-transformers

Use custom tokenizers in spacy-transformers

48
Emerging
3 Kaleidophon/token2index

A lightweight but powerful library to build token indices for NLP tasks,...

48
Emerging
4 NVIDIA/Cosmos-Tokenizer

A suite of image and video neural tokenizers

42
Emerging
5 Hugging-Face-Supporter/tftokenizers

Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels

38
Emerging
6 wangcongcong123/ttt

A package for fine-tuning Transformers with TPUs, written in Tensorflow2.0+

35
Emerging
7 nlpodyssey/gotokenizers

Go implementation of today's most used tokenizers

35
Emerging
8 Beomi/megatronlm_dataset_autotokenizer

Megatron-LM/GPT-NeoX compatible Text Encoder with 🤗Transformers AutoTokenizer.

23
Experimental
9 Mbeeee111/tokenizer.cpp

📦 Optimize tokenization in C++ for HuggingFace models with a fast,...

23
Experimental
10 mazebrr/language-tokenizer

🧩 Tokenize text efficiently across multiple languages using our robust...

22
Experimental
11 technion-cs-nlp/BiologicalTokenizers

Effect of tokenization on transformers for biological sequence

21
Experimental
12 dnbaker/bioseq

Tokenizers and Machine Learning Models for biological sequence data

21
Experimental
13 muna-ai/libtokenizers

C/C++ bindings from Huggingface Tokenizers.

19
Experimental
14 JaydenTeoh/beyond-next-token-prediction

Curated collection of research on the limitations of next-token prediction...

18
Experimental
15 symanto-research/merge-tokenizers

Package to align tokens from different tokenizations.

15
Experimental
16 hikmatazimzade/azerbaijani-tokenizer

High-Performance Azerbaijani Tokenizers (30% fewer tokens, 40% faster than...

14
Experimental
17 Mecanik/Tiny-BPE-Trainer

Lightweight, header-only Byte Pair Encoding (BPE) trainer in modern C++17....

14
Experimental
18 Systemcluster/tokenizer

General tokenizer library for the Web and Node. Supports Huggingface and...

11
Experimental
19 LauryneL/Pipographe-v2

Pipographe v2.0.0 est une application web reposant sur une base de données...

11
Experimental
20 NotShrirang/marathi-tokenizer

🖋️ A sleek, BPE-powered tokenizer that understands the richness of Marathi.

11
Experimental