tokenizers and language-tokenizer

These are competitors: Hugging Face's tokenizers library is a production-grade, widely-adopted implementation that handles state-of-the-art tokenization across multiple languages, while language-tokenizer appears to be an alternative approach with similar goals but lacks adoption and maintenance.

tokenizers
90
Verified
language-tokenizer
22
Experimental
Maintenance 20/25
Adoption 25/25
Maturity 25/25
Community 20/25
Maintenance 13/25
Adoption 0/25
Maturity 9/25
Community 0/25
Stars: 10,520
Forks: 1,051
Downloads: 129,702,376
Commits (30d): 33
Language: Rust
License: Apache-2.0
Stars:
Forks:
Downloads:
Commits (30d): 0
Language: Rust
License: WTFPL
No risk flags
No Package No Dependents

About tokenizers

huggingface/tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Implemented in Rust with Python/Node.js/Ruby bindings, it supports BPE, WordPiece, and Unigram tokenization algorithms with integrated normalization that tracks character-level alignment to original text. The library handles full preprocessing pipelines including truncation, padding, and special token injection, enabling both vocabulary training and inference through a unified modular API.

About language-tokenizer

mazebrr/language-tokenizer

🧩 Tokenize text efficiently across multiple languages using our robust library, combining Unicode and NLP techniques for accurate text analysis.

Scores updated daily from GitHub, PyPI, and npm data. How scores work