Need help with MPNet?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

microsoft
178 Stars 21 Forks MIT License 24 Commits 3 Opened issues

Description

MPNet: Masked and Permuted Pre-training for Language Understanding https://arxiv.org/pdf/2004.09297.pdf

Services available

!
?

Need anything else?

Contributors list

MPNet

MPNet: Masked and Permuted Pre-training for Language Understanding, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, is a novel pre-training method for language understanding tasks. It solves the problems of MLM (masked language modeling) in BERT and PLM (permuted language modeling) in XLNet and achieves better accuracy.

News: We have updated the pre-trained models now.

Supported Features

  • A unified view and implementation of several pre-training models including BERT, XLNet, MPNet, etc.
  • Code for pre-training and fine-tuning for a variety of language understanding (GLUE, SQuAD, RACE, etc) tasks.

Installation

We implement MPNet and this pre-training toolkit based on the codebase of fairseq. The installation is as follow:

pip install --editable pretraining/
pip install pytorch_transformers==1.0.0 transformers scipy sklearn

Pre-training MPNet

Our model is pre-trained with bert dictionary, you first need to

pip install transformers
to use bert tokenizer. We provide a script
encode.py
and a dictionary file
dict.txt
to tokenize your corpus. You can modify
encode.py
if you want to use other tokenizers (like roberta).

1) Preprocess data

We choose WikiText-103 as a demo. The running script is as follow:

wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip

for SPLIT in train valid test; do
python MPNet/encode.py
--inputs wikitext-103-raw/wiki.${SPLIT}.raw
--outputs wikitext-103-raw/wiki.${SPLIT}.bpe
--keep-empty
--workers 60;
done

Then, we need to binarize data. The command of binarizing data is following:

fairseq-preprocess \
    --only-source \
    --srcdict MPNet/dict.txt \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

2) Pre-train MPNet

The below command is to train a MPNet model: ``` TOTALUPDATES=125000 # Total number of training steps WARMUPUPDATES=10000 # Warmup the learning rate over this many updates PEAKLR=0.0005 # Peak learning rate, adjust as needed TOKENSPERSAMPLE=512 # Max sequence length MAXPOSITIONS=512 # Num. positional embeddings (usually same as above) MAXSENTENCES=16 # Number of sequences per batch (batch size) UPDATEFREQ=16 # Increase the batch size 16x

DATA_DIR=data-bin/wikitext-103

fairseq-train --fp16 $DATADIR \ --task maskedpermutationlm --criterion maskedpermutationcrossentropy \ --arch mpnetbase --sample-break-mode complete --tokens-per-sample $TOKENSPERSAMPLE \ --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \ --lr-scheduler polynomialdecay --lr $PEAKLR --warmup-updates $WARMUPUPDATES --total-num-update $TOTALUPDATES \ --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \ --max-sentences $MAXSENTENCES --update-freq $UPDATEFREQ \ --max-update $TOTALUPDATES --log-format simple --log-interval 1 --input-mode 'mpnet' ``

**Notes**: You can replace arch with
mpnetrelbase
and add command
--mask-whole-words --bpe bert` to use relative position embedding and whole word mask.

Notes: You can specify

--input-mode
as
mlm
or
plm
to train masked language model or permutation language model.

Pre-trained models

We have updated the final pre-trained MPNet model for fine-tuning.

You can load the pre-trained MPNet model like this:

python
from fairseq.models.masked_permutation_net import MPNet
mpnet = MPNet.from_pretrained('checkpoints', 'checkpoint_best.pt', 'path/to/data', bpe='bert')
assert isinstance(mpnet.model, torch.nn.Module)

Fine-tuning MPNet on down-streaming tasks

Acknowledgements

Our code is based on fairseq-0.8.0. Thanks for their contribution to the open-source commuity.

Reference

If you find this toolkit useful in your work, you can cite the corresponding papers listed below:

@article{song2020mpnet,
    title={MPNet: Masked and Permuted Pre-training for Language Understanding},
    author={Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan},
    journal={arXiv preprint arXiv:2004.09297},
    year={2020}
}

Related Works

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.