hplt-project/sacremoses

Python port of Moses tokenizer, truecaser and normalizer

78
/ 100
Verified

Provides language-specific tokenization with XML escaping, detokenization, and punctuation normalization via both Python API and CLI. The truecaser component learns capitalization patterns from training data and supports ASR-specific modes, while the normalizer handles Unicode punctuation and control character removal. Supports parallel processing across multiple languages and can chain operations in pipelines for end-to-end text preprocessing workflows.

495 stars and 2,424,232 monthly downloads. Used by 31 other packages. Available on PyPI.

Maintenance 10 / 25
Adoption 25 / 25
Maturity 25 / 25
Community 18 / 25

How are scores calculated?

Stars

495

Forks

60

Language

Python

License

MIT

Last pushed

Feb 06, 2026

Monthly downloads

2,424,232

Commits (30d)

0

Dependencies

4

Reverse dependents

31

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/hplt-project/sacremoses"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.