openscilab/tocount

ToCount: Lightweight Token Estimator

/ 100

Emerging

Provides multiple estimation strategies including rule-based approaches and linear regression models trained on tiktoken encodings (R50K, CL100K, O200K) plus emerging model tokenizers like Deepseek R1 and Llama 3.1. Uses a unified `TextEstimator` interface with pre-trained models benchmarked against real-world chat datasets, offering accuracy trade-offs from R² 0.62–0.97 depending on model selection and language specificity. Targets token budgeting for LLM applications with language-specific variants optimized for English text.

Available on PyPI.

No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 18 / 25

Community 4 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

guillaume-be/rust-tokenizers

Rust-tokenizer offers high-performance tokenizers for modern language models, including...

sugarme/tokenizer

NLP tokenizers written in Go language

elixir-nx/tokenizers

Elixir bindings for 🤗 Tokenizers

reinfer/blingfire-rs

Rust wrapper for the BlingFire tokenization library

frothywater/kanade-tokenizer

Kanade is a single-layer disentangled speech tokenizer that extracts compact tokens suitable for...

Explore NLP Tools

All categories Trending NLP directory Insights