Need help with corus?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

natasha
148 Stars 11 Forks MIT License 161 Commits 55 Opened issues

Description

Links to Russian corpora + Python functions for loading and parsing

Services available

!
?

Need anything else?

Contributors list

CI codecov

Links to publicly available Russian corpora + code for loading and parsing. 20+ datasets, 350Gb+ of text.

Usage

For example lets use dump of lenta.ru by @yutkin. Manually download the archive (link in the Reference section):

bash
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz

Use

corus
to load the data:
>>> from corus import load_lenta

>>> path = 'lenta-ru-news.csv.gz' >>> records = load_lenta(path) >>> next(records)

LentaRecord( url='https://lenta.ru/news/2018/12/14/cancer/', title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака', text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...', topic='Россия', tags='Общество' )

Iterate over texts:

>>> records = load_lenta(path)
>>> for record in records:
...     text = record.text
...     ...

For links to other datasets and their loaders see the Reference section.

Documentation

Materials are in Russian:

Install

corus
supports Python 3.5+, PyPy 3.
$ pip install corus

Reference

Dataset API from corus import Tags Texts Uncompressed Description
Lenta.ru
Lenta.ru v1.0 load_lenta # news 739 351 1.66 Gb wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
Lenta.ru v1.1+ load_lenta2 # news 800 975 1.94 Gb wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2
Lib.rus.ec load_librusec # fiction 301 871 144.92 Gb Dump of lib.rus.ec prepared for RUSSE workshop

wget http://panchenko.me/data/russe/librusec_fb2.plain.gz

Rossiya Segodnya load_ria_raw #

load_ria #

news 1 003 869 3.70 Gb wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz
Mokoron Russian Twitter Corpus load_mokoron # social sentiment 17 633 417 1.86 Gb Russian Twitter sentiment markup

Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql

Wikipedia load_wiki # 1 541 401 12.94 Gb Russian Wiki dump

wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2

GramEval2020 load_gramru # 162 372 30.04 Mb wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip

unzip master.zip

mv GramEval2020-master/dataTrain train

mv GramEval2020-master/dataOpenTest dev

rm -r master.zip GramEval2020-master

wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu

OpenCorpora load_corpora # morph 4 030 20.21 Mb wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip
RusVectores SimLex-965 load_simlex # emb sim wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv

wget https://rusvectores.org/static/testsets/ru_simlex965.tsv

Omnia Russica load_omnia # morph web fiction 489.62 Gb Taiga + Wiki + Araneum. Read "Even larger Russian corpus" https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf

Manually download http://bit.ly/2ZT4BY9

factRuEval-2016 load_factru # ner news 254 969.27 Kb Manual PER, LOC, ORG markup prepared for 2016 Dialog competition

wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip

unzip master.zip

rm master.zip

Gareev load_gareev # ner news 97 455.02 Kb Manual PER, ORG markup (no LOC)

Email Rinat Gareev ([email protected]) ask for dataset

tar -xvf rus-ner-news-corpus.iob.tar.gz

rm rus-ner-news-corpus.iob.tar.gz

Collection5 load_ne5 # ner news 1 000 2.96 Mb News articles with manual PER, LOC, ORG markup

wget http://www.labinform.ru/pub/named_entities/collection5.zip

unzip collection5.zip

rm collection5.zip

WiNER load_wikiner # ner 203 287 36.15 Mb Sentences from Wiki auto annotated with PER, LOC, ORG tags

wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2

BSNLP-2019 load_bsnlp # ner 464 1.16 Mb Markup prepared for 2019 BSNLP Shared Task

wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zip

wget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zip

unzip TRAININGDATA_BSNLP_2019_shared_task.zip

unzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bg

rm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip

Persons-1000 load_persons # ner news 1 000 2.96 Mb Same as Collection5, only PER markup + normalized names

wget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip

The Russian Drug Reaction Corpus (RuDReC) load_rudrec # ner 4 809 1.73 Kb RuDReC is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. Here you can download and work with the annotated part, to get the raw part (1.4M reviews) please refer to https://github.com/cimm-kzn/RuDReC.

wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json

Taiga Large collection of Russian texts from various sources: news sites, magazines, literacy, social networks

wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz

tar -xzvf retagged_taiga.tar.gz

Arzamas load_taiga_arzamas # news 311 4.50 Mb
Fontanka load_taiga_fontanka # news 342 683 786.23 Mb
Interfax load_taiga_interfax # news 46 429 77.55 Mb
KP load_taiga_kp # news 45 503 61.79 Mb
Lenta load_taiga_lenta # news 36 446 95.15 Mb
Taiga/N+1 load_taiga_nplus1 # news 7 696 24.96 Mb
Magazines load_taiga_magazines # 39 890 2.19 Gb
Subtitles load_taiga_subtitles # 19 011 909.08 Mb
Social load_taiga_social # social 1 876 442 648.18 Mb
Proza load_taiga_proza # fiction 1 732 434 38.25 Gb
Stihi load_taiga_stihi # 9 157 686 12.80 Gb
Russian NLP Datasets Several Russian news datasets from webhose.io, lenta.ru and other news sites.
News load_buriy_news # news 2 154 801 6.84 Gb Dump of top 40 news + 20 fashion news sites.

wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2

wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2

wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2

Webhose load_buriy_webhose # news 285 965 859.32 Mb Dump from webhose.io, 300 sources for one month.

wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/webhose-2016.tar.bz2

ODS #proj_news_viz Several news sites scraped by members of #proj_news_viz ODS project.
Interfax load_ods_interfax # news 543 961 1.22 Gb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/interfax.csv.gz
Gazeta load_ods_gazeta # news 865 847 1.63 Gb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/gazeta.csv.gz
Izvestia load_ods_izvestia # news 86 601 307.19 Mb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/iz.csv.gz
Meduza load_ods_meduza # news 71 806 270.11 Mb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/meduza.csv.gz
RIA load_ods_ria # news 101 543 233.88 Mb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/ria.csv.gz
Russia Today load_ods_rt # news 106 644 187.12 Mb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/rt.csv.gz
TASS load_ods_tass # news 1 135 635 3.27 Gb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/tass-001.csv.gz
Universal Dependencies
GSD load_ud_gsd # morph syntax 5 030 1.01 Mb wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conllu

wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conllu

wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu

Taiga load_ud_taiga # morph syntax 3 264 353.80 Kb wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conllu

wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conllu

wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu

PUD load_ud_pud # morph syntax 1 000 207.78 Kb wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu
SynTagRus load_ud_syntag # morph syntax 61 889 11.33 Mb wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conllu

wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conllu

wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu

morphoRuEval-2017
General Internet-Corpus load_morphoru_gicrya # morph 83 148 10.58 Mb wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zip

unzip GIKRYA_texts_new.zip

rm GIKRYA_texts_new.zip

Russian National Corpus load_morphoru_rnc # morph 98 892 12.71 Mb wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rar

unrar x RNC_texts.rar

rm RNC_texts.rar

OpenCorpora load_morphoru_corpora # morph 38 510 4.80 Mb wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rar

unrar x OpenCorpora_Texts.rar

rm OpenCorpora_Texts.rar

RUSSE Russian Semantic Relatedness
HJ: Human Judgements of Word Pairs load_russe_hj # emb sim wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv
RT: Synonyms and Hypernyms from the Thesaurus RuThes load_russe_rt # emb sim wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv
AE: Cognitive Associations from the Sociation.org Experiment load_russe_ae # emb sim wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csv

wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csv

wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv

Toloka Datasets
Lexical Relations from the Wisdom of the Crowd (LRWC) load_toloka_lrwc # emb sim wget https://tlk.s3.yandex.net/dataset/LRWC.zip

unzip LRWC.zip

rm LRWC.zip

The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT) load_ruadrect # social 9 515 2.09 Mb This corpus was developed for the Social Media Mining for Health Applications (#SMM4H) Shared Task 2020

wget https://github.com/cimm-kzn/RuDReC/raw/master/data/RuADReCT.zip

unzip RuADReCT.zip

rm RuADReCT.zip

Support

  • Chat — https://telegram.me/naturallanguageprocessing
  • Issues — https://github.com/natasha/corus/issues
  • Commercial support — https://lab.alexkuk.ru

Development

Tests:

make test

Add new source: 1. Implement

corus/sources/.py
2. Add import into
corus/sources/__init__.py
3. Add meta into
corus/source/meta.py
4. Add example into
docs.ipynb
(check meta table is correct) 5. Run tests (readme is updated)

Package:

make version
git push
git push --tags

make clean wheel upload

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.