michmech/lemmatization-lists

Machine-readable lists of lemma-token pairs in 23 languages.

50
/ 100
Established

Provides tab-separated lemma-token pairs sourced from Hunspell dictionaries, morphological lexicons (FreeLing, SALDO, Multext East), and language-specific databases, enabling query expansion for fulltext search engines. Data spans 25 languages with coverage ranging from ~6K pairs (Persian) to 3.3M pairs (Polish), distributed as UTF-8 plain-text files suitable for integration into NLP pipelines and search infrastructure. Licensed under ODbL, aggregating legally-sourced resources from academic morphology projects and community linguistic databases.

361 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 24 / 25

How are scores calculated?

Stars

361

Forks

98

Language

License

ODbL-1.0

Last pushed

Jan 29, 2022

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/michmech/lemmatization-lists"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.