jmaczan/bpe.c
High performance Byte-Pair Encoding tokenizer for large language models
This is a high-performance tool designed for individuals or teams who are training large language models. It takes in very large text datasets and efficiently converts them into token sequences using Byte-Pair Encoding. This process is crucial for preparing vast amounts of text for machine learning model training.
No commits in the last 6 months.
Use this if you need a fast and efficient way to preprocess extremely large text corpora for training large language models.
Not ideal if you are looking for a general-purpose text processing tool or if your focus is on smaller datasets for tasks other than large language model training.
Stars
3
Forks
—
Language
C
License
GPL-3.0
Category
Last pushed
Jun 23, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/jmaczan/bpe.c"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
georg-jung/FastBertTokenizer
Fast and memory-efficient library for WordPiece tokenization as it is used by BERT.
ml-rust/splintr
A high-performance tokenizer (BPE + SentencePiece) built with Rust with Python bindings, focused...
sefineh-ai/Amharic-Tokenizer
Syllable-aware BPE tokenizer for the Amharic language (አማርኛ) – fast, accurate, trainable.
sanderland/script_tok
Code for the paper "BPE stays on SCRIPT"
ash-01xor/bpe.c
Simple Byte pair Encoding mechanism used for tokenization process . written purely in C