Need help with jPTDP?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

datquocnguyen
147 Stars 28 Forks Other 29 Commits 0 Opened issues

Description

Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

Services available

!
?

Need anything else?

Contributors list

# 29,330
python3
pos-tag...
gibbs-s...
Python
29 commits

Neural Network Models for Joint POS Tagging and Dependency Parsing

jptdpv2

Implementations of joint models for POS tagging and dependency parsing, as described in my papers:

  1. Dat Quoc Nguyen and Karin Verspoor. 2018. An improved neural network model for joint POS tagging and dependency parsing. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 81-91. [.bib] (jPTDP v2.0)
  2. Dat Quoc Nguyen, Mark Dras and Mark Johnson. 2017. A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 134-142. [.bib] (jPTDP v1.0)

This github project currently supports jPTDP v2.0, while v1.0 can be found in the release section. Please cite paper [1] when jPTDP is used to produce published results or incorporated into other software. I would highly appreciate to have your bug reports, comments and suggestions about jPTDP. As a free open-source implementation, jPTDP is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Installation

jPTDP requires the following software packages:

  • Python 2.7
  • DyNet
    v2.0

    $ virtualenv -p python2.7 .DyNet $ source .DyNet/bin/activate $ pip install cython numpy $ pip install dynet==2.0.3

Once you installed the prerequisite packages above, you can clone or download (and then unzip) jPTDP. Next sections show instructions to train a new joint model for POS tagging and dependency parsing, and then to utilize a pre-trained model.

NOTE: jPTDP is also ported to run with Python 3.4+ by Santiago Castro. Also note that pre-trained models I provide in the last section would not work with this ported version (see a discussion). Thus, you may want to retrain jPTDP if using this ported version.

Train a joint model

Suppose that

SOURCE_DIR
is simply used to denote the source code directory. Similar to files
train.conllu
and
dev.conllu
in folder
SOURCE_DIR/sample
or treebanks in the Universal Dependencies (UD) project, the training and development files are formatted following 10-column data format. For training, jPTDP will only use information from columns 1 (ID), 2 (FORM), 4 (Coarse-grained POS tags---UPOSTAG), 7 (HEAD) and 8 (DEPREL).

To train a joint model for POS tagging and dependency parsing, you perform:

SOURCE_DIR$ python jPTDP.py --dynet-seed 123456789 [--dynet-mem ] [--epochs ] [--lstmdims ] [--lstmlayers ] [--hidden ] [--wembedding ] [--cembedding ] [--pembedding ] [--prevectors ] [--model ] [--params ] --outdir  --train   --dev 

where hyper-parameters in [] are optional:

  • --dynet-mem
    : Specify DyNet memory in MB.
  • --epochs
    : Specify number of training epochs. Default value is 30.
  • --lstmdims
    : Specify number of BiLSTM dimensions. Default value is 128.
  • --lstmlayers
    : Specify number of BiLSTM layers. Default value is 2.
  • --hidden
    : Specify size of MLP hidden layer. Default value is 100.
  • --wembedding
    : Specify size of word embeddings. Default value is 100.
  • --cembedding
    : Specify size of character embeddings. Default value is 50.
  • --pembedding
    : Specify size of POS tag embeddings. Default value is 100.
  • --prevectors
    : Specify path to the pre-trained word embedding file for initialization. Default value is "None" (i.e. word embeddings are randomly initialized).
  • --model
    : Specify a name for model parameters file. Default value is "model".
  • --params
    : Specify a name for model hyper-parameters file. Default value is "model.params".
  • --outdir
    : Specify path to directory where the trained model will be saved.
  • --train
    : Specify path to the training data file.
  • --dev
    : Specify path to the development data file.

For example:

SOURCE_DIR$ python jPTDP.py --dynet-seed 123456789 --dynet-mem 1000 --epochs 30 --lstmdims 128 --lstmlayers 2 --hidden 100 --wembedding 100 --cembedding 50 --pembedding 100  --model trialmodel --params trialmodel.params --outdir sample/ --train sample/train.conllu --dev sample/dev.conllu

will produce model files

trialmodel
and
trialmodel.params
in folder
SOURCE_DIR/sample
.

If you would like to use the fine-grained language-specific POS tags in the 5th column instead of the coarse-grained POS tags in the 4th column, you should use

swapper.py
in folder
SOURCE_DIR/utils
to swap contents in the 4th and 5th columns:
SOURCE_DIR$ python utils/swapper.py 

For example:

SOURCE_DIR$ python utils/swapper.py sample/train.conllu
SOURCE_DIR$ python utils/swapper.py sample/dev.conllu

will generate two new files for training:

train.conllu.ux2xu
and
dev.conllu.ux2xu
in folder
SOURCE_DIR/sample
.

Utilize a pre-trained model

Assume that you are going to utilize a pre-trained model for annotating a corpus whose each line represents a tokenized/word-segmented sentence. You should use

converter.py
in folder
SOURCE_DIR/utils
to obtain a 10-column data format of this corpus:
SOURCE_DIR$ python utils/converter.py 

For example:

SOURCE_DIR$ python utils/converter.py sample/test

will generate in folder

SOURCE_DIR/sample
a file named
test.conllu
which can be used later as input to the pre-trained model.

To utilize a pre-trained model for POS tagging and dependency parsing, you perform:

SOURCE_DIR$ python jPTDP.py --predict --model  --params  --test  --outdir  --output 
  • --model
    : Specify path to model parameters file.
  • --params
    : Specify path to model hyper-parameters file.
  • --test
    : Specify path to 10-column input file.
  • --outdir
    : Specify path to directory where output file will be saved.
  • --output
    : Specify name of the output file.

For example:

SOURCE_DIR$ python jPTDP.py --predict --model sample/trialmodel --params sample/trialmodel.params --test sample/test.conllu --outdir sample/ --output test.conllu.pred
SOURCE_DIR$ python jPTDP.py --predict --model sample/trialmodel --params sample/trialmodel.params --test sample/dev.conllu --outdir sample/ --output dev.conllu.pred

will produce output files

test.conllu.pred
and
dev.conllu.pred
in folder
SOURCE_DIR/sample
.

Pre-trained models

Pre-trained jPTDP v2.0 models, which were trained on English WSJ Penn treebank, GENIA and UD v2.2 treebanks, can be found at HERE. Results on test sets (as detailed in paper [1]) are as follows:

Treebank

Model name POS UAS LAS
English WSJ Penn treebank model256 97.97 94.51 92.87
English WSJ Penn treebank model 97.88 94.25 92.58

model256
and
model
denote the pre-trained models which use 256- and 128-dimensional LSTM hidden states, respectively, i.e.
model256
is more accurate but slower.

Treebank

Code UPOS UAS LAS
UDAfrikaans-AfriBooms afafribooms 95.73 82.57 78.89
UDAncientGreek-PROIEL grcproiel 96.05 77.57 72.84
UDAncientGreek-Perseus grcperseus 88.95 65.09 58.35
UDArabic-PADT arpadt 96.33 86.08 80.97
UDBasque-BDT eubdt 93.62 79.86 75.07
UDBulgarian-BTB bgbtb 98.07 91.47 87.69
UDCatalan-AnCora caancora 98.46 90.78 88.40
UDChinese-GSD zhgsd 93.26 82.50 77.51
UDCroatian-SET hrset 97.42 88.74 83.62
UDCzech-CAC cscac 98.87 89.85 87.13
UDCzech-FicTree csfictree 97.98 88.94 85.64
UDCzech-PDT cspdt 98.74 89.64 87.04
UDCzech-PUD cspud 96.71 87.62 82.28
UDDanish-DDT daddt 96.18 82.17 78.88
UDDutch-Alpino nlalpino 95.62 86.34 82.37
UDDutch-LassySmall nllassysmall 95.21 86.46 82.14
UDEnglish-EWT enewt 95.48 87.55 84.71
UDEnglish-GUM engum 94.10 84.88 80.45
UDEnglish-LinES enlines 95.55 80.34 75.40
UDEnglish-PUD enpud 95.25 87.49 84.25
UDEstonian-EDT etedt 96.87 85.45 82.13
UDFinnish-FTB fiftb 94.53 86.10 82.45
UDFinnish-PUD fipud 96.44 87.54 84.60
UDFinnish-TDT fitdt 96.12 86.07 82.92
UDFrench-GSD frgsd 97.11 89.45 86.43
UDFrench-Sequoia frsequoia 97.92 89.71 87.43
UDFrench-Spoken frspoken 94.25 79.80 73.45
UDGalician-CTG glctg 97.12 85.09 81.93
UDGalician-TreeGal gltreegal 93.66 77.71 71.63
UDGerman-GSD degsd 94.07 81.45 76.68
UDGothic-PROIEL gotproiel 93.45 79.80 71.85
UDGreek-GDT elgdt 96.59 87.52 84.64
UDHebrew-HTB hehtb 96.24 87.65 82.64
UDHindi-HDTB hihdtb 96.94 93.25 89.83
UDHungarian-Szeged huszeged 92.07 76.18 69.75
UDIndonesian-GSD idgsd 93.29 84.64 77.71
UDIrish-IDT gaidt 89.74 75.72 65.78
UDItalian-ISDT itisdt 98.01 92.33 90.20
UDItalian-PoSTWITA itpostwita 95.41 84.20 79.11
UDJapanese-GSD jagsd 97.27 94.21 92.02
UDJapanese-Modern jamodern 70.53 66.88 49.51
UDKorean-GSD kogsd 93.35 81.32 76.58
UDKorean-Kaist kokaist 93.53 83.59 80.74
UDLatin-ITTB laittb 98.12 82.99 79.96
UDLatin-PROIEL laproiel 95.54 74.95 69.76
UDLatin-Perseus laperseus 82.36 57.21 46.28
UDLatvian-LVTB lvlvtb 93.53 81.06 76.13
UDNorthSami-Giella smegiella 87.48 65.79 58.09
UDNorwegian-Bokmaal nobokmaal 97.73 89.83 87.57
UDNorwegian-Nynorsk nonynorsk 97.33 89.73 87.29
UDNorwegian-NynorskLIA nonynorsklia 85.22 64.14 54.31
UDOldChurchSlavonic-PROIEL cuproiel 93.69 80.59 73.93
UDOldFrench-SRCMF frosrcmf 95.12 86.65 81.15
UDPersian-Seraji faseraji 96.66 88.07 84.07
UDPolish-LFG pllfg 98.22 95.29 93.10
UDPolish-SZ plsz 97.05 90.98 87.66
UDPortuguese-Bosque ptbosque 96.76 88.67 85.71
UDRomanian-RRT rorrt 97.43 88.74 83.54
UDRussian-SynTagRus rusyntagrus 98.51 91.00 88.91
UDRussian-Taiga rutaiga 85.49 65.52 56.33
UDSerbian-SET srset 97.40 89.32 85.03
UDSlovak-SNK sksnk 95.18 85.88 81.89
UDSlovenian-SSJ slssj 97.79 88.26 86.10
UDSlovenian-SST slsst 89.50 66.14 58.13
UDSpanish-AnCora esancora 98.57 90.30 87.98
UDSwedish-LinES svlines 95.51 83.60 78.97
UDSwedish-PUD svpud 92.10 79.53 74.53
UDSwedish-Talbanken svtalbanken 96.55 86.53 83.01
UDTurkish-IMST trimst 92.93 70.53 62.55
UDUkrainian-IU ukiu 95.24 83.47 79.38
UDUrdu-UDTB urudtb 93.35 86.74 80.44
UDUyghur-UDT ugudt 87.63 76.14 63.37
UDVietnamese-VTB vivtb 87.63 67.72 58.27

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.