Need help with nlp-datasets?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

niderhoff
4.8K Stars 887 Forks 63 Commits 5 Opened issues

Description

Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP)

Services available

!
?

Need anything else?

Contributors list

# 477,897
24 commits
# 83,195
ecs-fra...
dwm
Neovim
unix
2 commits
# 8,660
HTML
JavaScr...
Shell
React
2 commits
N/A
2 commits
# 303,331
C#
bayesia...
HTML
text-pr...
1 commit
# 373,177
xlnet
computa...
1 commit
# 726,664
cross-d...
1 commit
# 49,029
cython
Sass
Shell
entity-...
1 commit
# 442,196
Shell
C++
C
heartbe...
1 commit
# 123,677
jupyter
CSS
pandas
data-ex...
1 commit
# 696,077
1 commit
# 716,647
1 commit
# 150,135
TeX
Shell
MATLAB
flashca...
1 commit

nlp-datasets

Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP). Most stuff here is just raw unstructured text data, if you are looking for annotated corpora or Treebanks refer to the sources at the bottom.

Datasets (English, multilang)

Sources

Datasets (Albanian)

  • Albanian News Articles Dataset: Over 3 million Albanian news articles alongwith metadata, extracted from various albanian news sources (see list in link).

Datasets (Arabic)

  • SaudiNewsNet: 31,030 Arabic newspaper articles alongwith metadata, extracted from various online Saudi newspapers. (2 MB)

Datasets (Urdu)

Datasets (German)

  • German Political Speeches Corpus: collection of recent speeches held by top German representatives (25 MB, 11 MTokens)

  • NEGRA: A Syntactically Annotated Corpus of German Newspaper Texts. Available for free for all Universities and non-profit organizations. Need to sign and send form to obtain. (on request)

  • Ten Thousand German News Articles Dataset: 10273 german language news articles categorized into nine classes for topic classification. (26.1 MB)

  • 100k German Court Decisions: Open Legal Data releases a dataset of 100,000 German court decisions and 444,000 citations (772 MB)

Datasets (Kinyarwanda and Kirundi)

  • KINNEWS and KIRNEWS: Two annotated and cleaned datasets of more than 20k Kinyarwanda and 4k Kirundi news articles. (65 MB)

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.