Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)
Implementations of joint models for POS tagging and dependency parsing, as described in my papers:
This github project currently supports jPTDP v2.0, while v1.0 can be found in the release section. Please cite paper [1] when jPTDP is used to produce published results or incorporated into other software. I would highly appreciate to have your bug reports, comments and suggestions about jPTDP. As a free open-source implementation, jPTDP is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
jPTDP requires the following software packages:
Python 2.7
DyNetv2.0
$ virtualenv -p python2.7 .DyNet $ source .DyNet/bin/activate $ pip install cython numpy $ pip install dynet==2.0.3
Once you installed the prerequisite packages above, you can clone or download (and then unzip) jPTDP. Next sections show instructions to train a new joint model for POS tagging and dependency parsing, and then to utilize a pre-trained model.
NOTE: jPTDP is also ported to run with Python 3.4+ by Santiago Castro. Also note that pre-trained models I provide in the last section would not work with this ported version (see a discussion). Thus, you may want to retrain jPTDP if using this ported version.
Suppose that
SOURCE_DIRis simply used to denote the source code directory. Similar to files
train.conlluand
dev.conlluin folder
SOURCE_DIR/sampleor treebanks in the Universal Dependencies (UD) project, the training and development files are formatted following 10-column data format. For training, jPTDP will only use information from columns 1 (ID), 2 (FORM), 4 (Coarse-grained POS tags---UPOSTAG), 7 (HEAD) and 8 (DEPREL).
To train a joint model for POS tagging and dependency parsing, you perform:
SOURCE_DIR$ python jPTDP.py --dynet-seed 123456789 [--dynet-mem ] [--epochs ] [--lstmdims ] [--lstmlayers ] [--hidden ] [--wembedding ] [--cembedding ] [--pembedding ] [--prevectors ] [--model ] [--params ] --outdir --train --dev
where hyper-parameters in [] are optional:
--dynet-mem: Specify DyNet memory in MB.
--epochs: Specify number of training epochs. Default value is 30.
--lstmdims: Specify number of BiLSTM dimensions. Default value is 128.
--lstmlayers: Specify number of BiLSTM layers. Default value is 2.
--hidden: Specify size of MLP hidden layer. Default value is 100.
--wembedding: Specify size of word embeddings. Default value is 100.
--cembedding: Specify size of character embeddings. Default value is 50.
--pembedding: Specify size of POS tag embeddings. Default value is 100.
--prevectors: Specify path to the pre-trained word embedding file for initialization. Default value is "None" (i.e. word embeddings are randomly initialized).
--model: Specify a name for model parameters file. Default value is "model".
--params: Specify a name for model hyper-parameters file. Default value is "model.params".
--outdir: Specify path to directory where the trained model will be saved.
--train: Specify path to the training data file.
--dev: Specify path to the development data file.
For example:
SOURCE_DIR$ python jPTDP.py --dynet-seed 123456789 --dynet-mem 1000 --epochs 30 --lstmdims 128 --lstmlayers 2 --hidden 100 --wembedding 100 --cembedding 50 --pembedding 100 --model trialmodel --params trialmodel.params --outdir sample/ --train sample/train.conllu --dev sample/dev.conllu
will produce model files
trialmodeland
trialmodel.paramsin folder
SOURCE_DIR/sample.
If you would like to use the fine-grained language-specific POS tags in the 5th column instead of the coarse-grained POS tags in the 4th column, you should use
swapper.pyin folder
SOURCE_DIR/utilsto swap contents in the 4th and 5th columns:
SOURCE_DIR$ python utils/swapper.py
For example:
SOURCE_DIR$ python utils/swapper.py sample/train.conllu SOURCE_DIR$ python utils/swapper.py sample/dev.conllu
will generate two new files for training:
train.conllu.ux2xuand
dev.conllu.ux2xuin folder
SOURCE_DIR/sample.
Assume that you are going to utilize a pre-trained model for annotating a corpus whose each line represents a tokenized/word-segmented sentence. You should use
converter.pyin folder
SOURCE_DIR/utilsto obtain a 10-column data format of this corpus:
SOURCE_DIR$ python utils/converter.py
For example:
SOURCE_DIR$ python utils/converter.py sample/test
will generate in folder
SOURCE_DIR/samplea file named
test.conlluwhich can be used later as input to the pre-trained model.
To utilize a pre-trained model for POS tagging and dependency parsing, you perform:
SOURCE_DIR$ python jPTDP.py --predict --model --params --test --outdir --output
--model: Specify path to model parameters file.
--params: Specify path to model hyper-parameters file.
--test: Specify path to 10-column input file.
--outdir: Specify path to directory where output file will be saved.
--output: Specify name of the output file.
For example:
SOURCE_DIR$ python jPTDP.py --predict --model sample/trialmodel --params sample/trialmodel.params --test sample/test.conllu --outdir sample/ --output test.conllu.pred SOURCE_DIR$ python jPTDP.py --predict --model sample/trialmodel --params sample/trialmodel.params --test sample/dev.conllu --outdir sample/ --output dev.conllu.pred
will produce output files
test.conllu.predand
dev.conllu.predin folder
SOURCE_DIR/sample.
Pre-trained jPTDP v2.0 models, which were trained on English WSJ Penn treebank, GENIA and UD v2.2 treebanks, can be found at HERE. Results on test sets (as detailed in paper [1]) are as follows:
Treebank |
Model name | POS | UAS | LAS |
---|---|---|---|---|
English WSJ Penn treebank | model256 | 97.97 | 94.51 | 92.87 |
English WSJ Penn treebank | model | 97.88 | 94.25 | 92.58 |
model256and
modeldenote the pre-trained models which use 256- and 128-dimensional LSTM hidden states, respectively, i.e.
model256is more accurate but slower.
Treebank |
Code | UPOS | UAS | LAS |
---|---|---|---|---|
UDAfrikaans-AfriBooms | afafribooms | 95.73 | 82.57 | 78.89 |
UDAncientGreek-PROIEL | grcproiel | 96.05 | 77.57 | 72.84 |
UDAncientGreek-Perseus | grcperseus | 88.95 | 65.09 | 58.35 |
UDArabic-PADT | arpadt | 96.33 | 86.08 | 80.97 |
UDBasque-BDT | eubdt | 93.62 | 79.86 | 75.07 |
UDBulgarian-BTB | bgbtb | 98.07 | 91.47 | 87.69 |
UDCatalan-AnCora | caancora | 98.46 | 90.78 | 88.40 |
UDChinese-GSD | zhgsd | 93.26 | 82.50 | 77.51 |
UDCroatian-SET | hrset | 97.42 | 88.74 | 83.62 |
UDCzech-CAC | cscac | 98.87 | 89.85 | 87.13 |
UDCzech-FicTree | csfictree | 97.98 | 88.94 | 85.64 |
UDCzech-PDT | cspdt | 98.74 | 89.64 | 87.04 |
UDCzech-PUD | cspud | 96.71 | 87.62 | 82.28 |
UDDanish-DDT | daddt | 96.18 | 82.17 | 78.88 |
UDDutch-Alpino | nlalpino | 95.62 | 86.34 | 82.37 |
UDDutch-LassySmall | nllassysmall | 95.21 | 86.46 | 82.14 |
UDEnglish-EWT | enewt | 95.48 | 87.55 | 84.71 |
UDEnglish-GUM | engum | 94.10 | 84.88 | 80.45 |
UDEnglish-LinES | enlines | 95.55 | 80.34 | 75.40 |
UDEnglish-PUD | enpud | 95.25 | 87.49 | 84.25 |
UDEstonian-EDT | etedt | 96.87 | 85.45 | 82.13 |
UDFinnish-FTB | fiftb | 94.53 | 86.10 | 82.45 |
UDFinnish-PUD | fipud | 96.44 | 87.54 | 84.60 |
UDFinnish-TDT | fitdt | 96.12 | 86.07 | 82.92 |
UDFrench-GSD | frgsd | 97.11 | 89.45 | 86.43 |
UDFrench-Sequoia | frsequoia | 97.92 | 89.71 | 87.43 |
UDFrench-Spoken | frspoken | 94.25 | 79.80 | 73.45 |
UDGalician-CTG | glctg | 97.12 | 85.09 | 81.93 |
UDGalician-TreeGal | gltreegal | 93.66 | 77.71 | 71.63 |
UDGerman-GSD | degsd | 94.07 | 81.45 | 76.68 |
UDGothic-PROIEL | gotproiel | 93.45 | 79.80 | 71.85 |
UDGreek-GDT | elgdt | 96.59 | 87.52 | 84.64 |
UDHebrew-HTB | hehtb | 96.24 | 87.65 | 82.64 |
UDHindi-HDTB | hihdtb | 96.94 | 93.25 | 89.83 |
UDHungarian-Szeged | huszeged | 92.07 | 76.18 | 69.75 |
UDIndonesian-GSD | idgsd | 93.29 | 84.64 | 77.71 |
UDIrish-IDT | gaidt | 89.74 | 75.72 | 65.78 |
UDItalian-ISDT | itisdt | 98.01 | 92.33 | 90.20 |
UDItalian-PoSTWITA | itpostwita | 95.41 | 84.20 | 79.11 |
UDJapanese-GSD | jagsd | 97.27 | 94.21 | 92.02 |
UDJapanese-Modern | jamodern | 70.53 | 66.88 | 49.51 |
UDKorean-GSD | kogsd | 93.35 | 81.32 | 76.58 |
UDKorean-Kaist | kokaist | 93.53 | 83.59 | 80.74 |
UDLatin-ITTB | laittb | 98.12 | 82.99 | 79.96 |
UDLatin-PROIEL | laproiel | 95.54 | 74.95 | 69.76 |
UDLatin-Perseus | laperseus | 82.36 | 57.21 | 46.28 |
UDLatvian-LVTB | lvlvtb | 93.53 | 81.06 | 76.13 |
UDNorthSami-Giella | smegiella | 87.48 | 65.79 | 58.09 |
UDNorwegian-Bokmaal | nobokmaal | 97.73 | 89.83 | 87.57 |
UDNorwegian-Nynorsk | nonynorsk | 97.33 | 89.73 | 87.29 |
UDNorwegian-NynorskLIA | nonynorsklia | 85.22 | 64.14 | 54.31 |
UDOldChurchSlavonic-PROIEL | cuproiel | 93.69 | 80.59 | 73.93 |
UDOldFrench-SRCMF | frosrcmf | 95.12 | 86.65 | 81.15 |
UDPersian-Seraji | faseraji | 96.66 | 88.07 | 84.07 |
UDPolish-LFG | pllfg | 98.22 | 95.29 | 93.10 |
UDPolish-SZ | plsz | 97.05 | 90.98 | 87.66 |
UDPortuguese-Bosque | ptbosque | 96.76 | 88.67 | 85.71 |
UDRomanian-RRT | rorrt | 97.43 | 88.74 | 83.54 |
UDRussian-SynTagRus | rusyntagrus | 98.51 | 91.00 | 88.91 |
UDRussian-Taiga | rutaiga | 85.49 | 65.52 | 56.33 |
UDSerbian-SET | srset | 97.40 | 89.32 | 85.03 |
UDSlovak-SNK | sksnk | 95.18 | 85.88 | 81.89 |
UDSlovenian-SSJ | slssj | 97.79 | 88.26 | 86.10 |
UDSlovenian-SST | slsst | 89.50 | 66.14 | 58.13 |
UDSpanish-AnCora | esancora | 98.57 | 90.30 | 87.98 |
UDSwedish-LinES | svlines | 95.51 | 83.60 | 78.97 |
UDSwedish-PUD | svpud | 92.10 | 79.53 | 74.53 |
UDSwedish-Talbanken | svtalbanken | 96.55 | 86.53 | 83.01 |
UDTurkish-IMST | trimst | 92.93 | 70.53 | 62.55 |
UDUkrainian-IU | ukiu | 95.24 | 83.47 | 79.38 |
UDUrdu-UDTB | urudtb | 93.35 | 86.74 | 80.44 |
UDUyghur-UDT | ugudt | 87.63 | 76.14 | 63.37 |
UDVietnamese-VTB | vivtb | 87.63 | 67.72 | 58.27 |