openscilab/tocount
ToCount: Lightweight Token Estimator
Provides multiple estimation strategies including rule-based approaches and linear regression models trained on tiktoken encodings (R50K, CL100K, O200K) plus emerging model tokenizers like Deepseek R1 and Llama 3.1. Uses a unified `TextEstimator` interface with pre-trained models benchmarked against real-world chat datasets, offering accuracy trade-offs from R² 0.62–0.97 depending on model selection and language specificity. Targets token budgeting for LLM applications with language-specific variants optimized for English text.
Available on PyPI.
Stars
21
Forks
1
Language
Python
License
MIT
Category
Last pushed
Feb 14, 2026
Monthly downloads
47
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/openscilab/tocount"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
guillaume-be/rust-tokenizers
Rust-tokenizer offers high-performance tokenizers for modern language models, including...
sugarme/tokenizer
NLP tokenizers written in Go language
elixir-nx/tokenizers
Elixir bindings for 🤗 Tokenizers
reinfer/blingfire-rs
Rust wrapper for the BlingFire tokenization library
frothywater/kanade-tokenizer
Kanade is a single-layer disentangled speech tokenizer that extracts compact tokens suitable for...