DolbyUUU/byte_pair_encoding_BPE_subword_tokenization_implementation_python
Byte-Pair Encoding (BPE) (subword-based tokenization) algorithm implementaions from scratch with python
This project breaks down raw text into smaller, meaningful pieces (subwords) that are better understood by machine learning models. You provide a large collection of text (a corpus) as input, and it outputs a tokenizer that can split any new text into these learned subword units. It is ideal for data scientists or NLP engineers building language models or text analysis tools.
No commits in the last 6 months.
Use this if you need to prepare text data for natural language processing tasks, especially when dealing with large vocabularies or out-of-vocabulary words.
Not ideal if you're looking for a pre-trained, production-ready tokenizer or a solution for non-text data.
Stars
18
Forks
—
Language
Python
License
—
Category
Last pushed
Jan 30, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/DolbyUUU/byte_pair_encoding_BPE_subword_tokenization_implementation_python"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
soaxelbrooke/python-bpe
Byte Pair Encoding for Python!