nlp-datasets

by niderhoff

niderhoff / nlp-datasets

Alphabetical list of free/public domain datasets with text data for use in Natural Language Processi...

4.3K Stars 809 Forks Last release: Not found 48 Commits 0 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:

nlp-datasets

Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP). Most stuff here is just raw unstructured text data, if you are looking for annotated corpora or Treebanks refer to the sources at the bottom.

Datasets (English, multilang)

Sources

Datasets (Arabic)

  • SaudiNewsNet: 31,030 Arabic newspaper articles alongwith metadata, extracted from various online Saudi newspapers. (2 MB)

Datasets (Urdu)

Datasets (German)

  • German Political Speeches Corpus: collection of recent speeches held by top German representatives (25 MB, 11 MTokens)

  • NEGRA: A Syntactically Annotated Corpus of German Newspaper Texts. Available for free for all Universities and non-profit organizations. Need to sign and send form to obtain. (on request)

  • Ten Thousand German News Articles Dataset: 10273 german language news articles categorized into nine classes for topic classification. (26.1 MB)

  • 100k German Court Decisions: Open Legal Data releases a dataset of 100,000 German court decisions and 444,000 citations (772 MB)

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.