michmech/lemmatization-lists

Machine-readable lists of lemma-token pairs in 23 languages.

/ 100

Established

Provides tab-separated lemma-token pairs sourced from Hunspell dictionaries, morphological lexicons (FreeLing, SALDO, Multext East), and language-specific databases, enabling query expansion for fulltext search engines. Data spans 25 languages with coverage ranging from ~6K pairs (Persian) to 3.3M pairs (Polish), distributed as UTF-8 plain-text files suitable for integration into NLP pipelines and search infrastructure. Licensed under ODbL, aggregating legally-sourced resources from academic morphology projects and community linguistic databases.

361 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 24 / 25

How are scores calculated?

Stars

361

Forks

Language

—

License

ODbL-1.0

Related tools

hplt-project/sacremoses

Python port of Moses tokenizer, truecaser and normalizer

adbar/simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

sorenlind/lemmy

🤘Lemmy is a lemmatizer for Danish 🇩🇰 and Swedish 🇸🇪

winkjs/wink-lemmatizer

English lemmatizer

Blake-Madden/OleanderStemmingLibrary

Porter stemming library (C++)

Explore NLP Tools

All categories Trending NLP directory Insights