jmaczan/bpe.c

High performance Byte-Pair Encoding tokenizer for large language models

/ 100

Experimental

This is a high-performance tool designed for individuals or teams who are training large language models. It takes in very large text datasets and efficiently converts them into token sequences using Byte-Pair Encoding. This process is crucial for preparing vast amounts of text for machine learning model training.

No commits in the last 6 months.

Use this if you need a fast and efficient way to preprocess extremely large text corpora for training large language models.

Not ideal if you are looking for a general-purpose text processing tool or if your focus is on smaller datasets for tasks other than large language model training.

large-language-models natural-language-processing text-preprocessing machine-learning-engineering AI-model-training

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 3 / 25

Maturity 16 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

License

GPL-3.0

Higher-rated alternatives

georg-jung/FastBertTokenizer

Fast and memory-efficient library for WordPiece tokenization as it is used by BERT.

ml-rust/splintr

A high-performance tokenizer (BPE + SentencePiece) built with Rust with Python bindings, focused...

sefineh-ai/Amharic-Tokenizer

Syllable-aware BPE tokenizer for the Amharic language (አማርኛ) – fast, accurate, trainable.

sanderland/script_tok

Code for the paper "BPE stays on SCRIPT"

ash-01xor/bpe.c

Simple Byte pair Encoding mechanism used for tokenization process . written purely in C

Explore NLP Tools

All categories Trending NLP directory Insights