hplt-project/sacremoses
Python port of Moses tokenizer, truecaser and normalizer
Provides language-specific tokenization with XML escaping, detokenization, and punctuation normalization via both Python API and CLI. The truecaser component learns capitalization patterns from training data and supports ASR-specific modes, while the normalizer handles Unicode punctuation and control character removal. Supports parallel processing across multiple languages and can chain operations in pipelines for end-to-end text preprocessing workflows.
495 stars and 2,424,232 monthly downloads. Used by 31 other packages. Available on PyPI.
Stars
495
Forks
60
Language
Python
License
MIT
Category
Last pushed
Feb 06, 2026
Monthly downloads
2,424,232
Commits (30d)
0
Dependencies
4
Reverse dependents
31
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/hplt-project/sacremoses"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
adbar/simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
sorenlind/lemmy
🤘Lemmy is a lemmatizer for Danish 🇩🇰 and Swedish 🇸🇪
winkjs/wink-lemmatizer
English lemmatizer
Blake-Madden/OleanderStemmingLibrary
Porter stemming library (C++)
htaghizadeh/PersianStemmer-Python
PersianStemmer-Python