huggingface/tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

/ 100

Verified

Implemented in Rust with Python/Node.js/Ruby bindings, it supports BPE, WordPiece, and Unigram tokenization algorithms with integrated normalization that tracks character-level alignment to original text. The library handles full preprocessing pipelines including truncation, padding, and special token injection, enabling both vocabulary training and inference through a unified modular API.

10,520 stars and 129,702,376 monthly downloads. Used by 122 other packages. Actively maintained with 33 commits in the last 30 days. Available on PyPI and npm.

Maintenance 20 / 25

Adoption 25 / 25

Maturity 25 / 25

Community 20 / 25

How are scores calculated?

Stars

10,520

Forks

1,051

Language

Rust

License

Apache-2.0

Featured in

We Audited crewAI's AI Dependencies: Here's What the Data Says

Compare

tokenizers and tftokenizers tokenizers and gotokenizers tokenizers and tokenizer.cpp tokenizers and language-tokenizer

Related models

megagonlabs/ginza-transformers

Use custom tokenizers in spacy-transformers

Kaleidophon/token2index

A lightweight but powerful library to build token indices for NLP tasks, compatible with major...

Hugging-Face-Supporter/tftokenizers

Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels

NVIDIA/Cosmos-Tokenizer

A suite of image and video neural tokenizers

wangcongcong123/ttt

A package for fine-tuning Transformers with TPUs, written in Tensorflow2.0+

Explore Transformer Models

All categories Trending Transformer directory Insights