Need help with lemmatization-lists?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

189 Stars 69 Forks Other 9 Commits 8 Opened issues


Machine-readable lists of lemma-token pairs in 23 languages.

Services available


Need anything else?

Contributors list

# 315,421
7 commits

Lemmatization Lists

These are large-coverage, machine-readable lemma/token pairs in several languages which I have collected (legally) from various sources, mostly as part of my work on the Global Glossary project. I use these for query expansion during fulltext searches: if a user searches for the lemma walk, the query is expanded to also search for the tokens walking, walked etc.

These are plain text files (zipped). Each line contains one lemma/token pair separated by a tab character in this sequence: lemma, tab, token. The files are encoded in UTF-8 with Windows-style line breaks.

  • Asturian (ast) (108,792 pairs)
  • Bulgarian (bg) (30,323 pairs)
  • Catalan (ca) (591,534 pairs)
  • Czech (cs) (36,400 pairs)
  • English (en) (41,760 pairs)
  • Estonian (et) (80,536 pairs)
  • French (fr) (224,002 pairs)
  • Galician (gl) (392,856 pairs)
  • German (de) (358,473 pairs)
  • Hungarian (hu) (39,898 pairs)
  • Irish (ga) (415,502 pairs)
  • Manx Gaelic (gv) (67,177 pairs)
  • Italian (it) (341,074 pairs)
  • Persian/Farsi (fa) (6,273 pairs)
  • Polish (pl) (3,296,232 pairs)
  • Portuguese (pt) (850,264 pairs)
  • Romanian (ro) (314,810 pairs)
  • Russian (ru) (537,810 pairs)
  • Scottish Gaelic (gd) (51,624 pairs)
  • Slovak (sk) (858,414 pairs)
  • Slovene (sl) (99,063 pairs)
  • Spanish (es) (497,560 pairs)
  • Swedish (sv) (675,137 pairs)
  • Ukrainian (uk) (193,703 pairs)
  • Welsh (cy) (359,224 pairs)



We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.