scarletcho/KoLM

Korean text normalization and language preparation package for LM in Kaldi-based ASR system

55
/ 100
Established

Provides morphological analysis via KoNLPy/Mecab integration and generates two granularity levels of pseudo-morphemes (micro and medium units) for flexible tokenization in language model training. The pipeline chains text normalization, character transcription (numbers, hanja, hangul jamos, alphabets), morphological tagging, and grapheme-to-phoneme conversion to produce Kaldi-compatible lexicon and corpus files, with explicit support for UTagger morphological analyzer output format.

No commits in the last 6 months. Available on PyPI.

Stale 6m No Dependents
Maintenance 0 / 25
Adoption 11 / 25
Maturity 25 / 25
Community 19 / 25

How are scores calculated?

Stars

63

Forks

21

Language

Python

License

GPL-3.0

Last pushed

Apr 23, 2020

Monthly downloads

12

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/voice-ai/scarletcho/KoLM"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.