brightjonathan/BPE-Stanford
We train and implement a byte-level byte-pair encoding (BPE) tokenizer. In particular, we represent arbitrary (Unicode) strings as a sequence of bytes and train our BPE tokenizer on this byte sequence. We use this tokenizer to encode text (a string) into tokens (a sequence of integers)
No commits in the last 6 months.
Stars
—
Forks
—
Language
Python
License
MIT
Category
Last pushed
Aug 07, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/brightjonathan/BPE-Stanford"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
guillaume-be/rust-tokenizers
Rust-tokenizer offers high-performance tokenizers for modern language models, including...
sugarme/tokenizer
NLP tokenizers written in Go language
elixir-nx/tokenizers
Elixir bindings for 🤗 Tokenizers
openscilab/tocount
ToCount: Lightweight Token Estimator
reinfer/blingfire-rs
Rust wrapper for the BlingFire tokenization library