A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.
A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.
Table of Contents:
Opus - the open parallel corpus - you can select languages and download only polish file
Polish Parliamentary Corpus text from proceedings of Polish Parliament, Sejm and Senate
Universal Sentence Encoder Multilingual - sentence embeddings, it covers 16 languages (including Polish)
BPEmb: Subword Embeddings includes polish - easy to use with Flair
Stanza (Python) - NLP analysis package from Stanford University. Stanza is a Python natural language analysis package. It contains tools, which can be used for: sentence/word tokenizing, to generate base forms of words, parts of speech and morphological features, syntactic dependency parsing, recognizing named entities. Contains Polish model
A curated list of Polish abbreviations for NLTK sentence tokenizer based on Wikipedia text
If you have or know valuable materials (datasets, models, posts, articles) that are missing here, please feel free to edit and submit a pull request. You can also send me a note on LinkedIn or via email:[email protected]