michmech/lemmatization-lists
Machine-readable lists of lemma-token pairs in 23 languages.
Provides tab-separated lemma-token pairs sourced from Hunspell dictionaries, morphological lexicons (FreeLing, SALDO, Multext East), and language-specific databases, enabling query expansion for fulltext search engines. Data spans 25 languages with coverage ranging from ~6K pairs (Persian) to 3.3M pairs (Polish), distributed as UTF-8 plain-text files suitable for integration into NLP pipelines and search infrastructure. Licensed under ODbL, aggregating legally-sourced resources from academic morphology projects and community linguistic databases.
361 stars. No commits in the last 6 months.
Stars
361
Forks
98
Language
—
License
ODbL-1.0
Category
Last pushed
Jan 29, 2022
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/michmech/lemmatization-lists"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
hplt-project/sacremoses
Python port of Moses tokenizer, truecaser and normalizer
adbar/simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
sorenlind/lemmy
🤘Lemmy is a lemmatizer for Danish 🇩🇰 and Swedish 🇸🇪
winkjs/wink-lemmatizer
English lemmatizer
Blake-Madden/OleanderStemmingLibrary
Porter stemming library (C++)