Need help with nlp-datasets?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

niderhoff
4.6K Stars 847 Forks 48 Commits 10 Opened issues

Description

Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP)

Services available

!
?

Need anything else?

Contributors list

nlp-datasets

Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP). Most stuff here is just raw unstructured text data, if you are looking for annotated corpora or Treebanks refer to the sources at the bottom.

Datasets (English, multilang)

Sources

Datasets (Arabic)

  • SaudiNewsNet: 31,030 Arabic newspaper articles alongwith metadata, extracted from various online Saudi newspapers. (2 MB)

Datasets (Urdu)

Datasets (German)

  • German Political Speeches Corpus: collection of recent speeches held by top German representatives (25 MB, 11 MTokens)

  • NEGRA: A Syntactically Annotated Corpus of German Newspaper Texts. Available for free for all Universities and non-profit organizations. Need to sign and send form to obtain. (on request)

  • Ten Thousand German News Articles Dataset: 10273 german language news articles categorized into nine classes for topic classification. (26.1 MB)

  • 100k German Court Decisions: Open Legal Data releases a dataset of 100,000 German court decisions and 444,000 citations (772 MB)

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.