Need help with UER-py?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

dbiir
1.3K Stars 234 Forks 653 Commits 53 Opened issues

Description

Open Source Pre-training Model Framework in PyTorch & Pre-trained Model Zoo

Services available

!
?

Need anything else?

Contributors list

English | 中文

Build Status codebeat badge

Pre-training has become an essential part for NLP tasks and has led to remarkable improvements. UER-py (Universal Encoder Representations) is a toolkit for pre-training on general-domain corpus and fine-tuning on downstream task. UER-py maintains model modularity and supports research extensibility. It facilitates the use of pre-training models, and provides interfaces for users to further extend upon. With UER-py, we build a model zoo which contains pre-trained models based on different corpora, encoders, and targets.


Table of Contents


Features

UER-py has the following features: - Reproducibility. UER-py has been tested on many datasets and should match the performances of the original pre-training model implementations such as BERT, GPT, ELMo, and T5. - Multi-GPU. UER-py supports CPU mode, single GPU mode, and distributed training mode. - Model modularity. UER-py is divided into multiple components: embedding, encoder, target, and downstream task fine-tuning. Ample modules are implemented in each component. Clear and robust interface allows users to combine modules with as few restrictions as possible. - Efficiency. UER-py refines its pre-processing, pre-training, and fine-tuning stages, which largely improves speed and needs less memory. - Model zoo. With the help of UER-py, we pre-trained models with different corpora, encoders, and targets. Proper selection of pre-trained models is important to the downstream task performances. - SOTA results. UER-py supports comprehensive downstream tasks (e.g. classification and machine reading comprehension) and has been used in winning solutions of many NLP competitions.


Requirements

  • Python 3.6
  • torch >= 1.1
  • six >= 1.12.0
  • argparse
  • packaging
  • For the mixed precision training you will need apex from NVIDIA
  • For the pre-trained model conversion (related with TensorFlow) you will need TensorFlow
  • For the tokenization with sentencepiece model you will need SentencePiece
  • For developing a stacking model you will need LightGBM and BayesianOptimization


Quickstart

This section uses several commonly-used examples to demonstrate how to use UER-py. More details are discussed in Instructions. We firstly use BERT model on Douban book review classification dataset. We pre-train model on book review corpus and then fine-tune it on classification dataset. There are three input files: book review corpus, book review classification dataset, and vocabulary. All files are encoded in UTF-8 and are included in this project.

The format of the corpus for BERT is as follows: ``` doc1-sent1 doc1-sent2 doc1-sent3

doc2-sent1

doc3-sent1 doc3-sent2 ``` The book review corpus is obtained by book review classification dataset. We remove labels and split a review into two parts from the middle (See bookreviewbert.txt in corpora folder).

The format of the classification dataset is as follows:

label    text_a
1        instance1
0        instance2
1        instance3
Label and instance are separated by \t . The first row is a list of column names. The label ID should be an integer between (and including) 0 and n-1 for n-way classification.

We use Google's Chinese vocabulary file models/googlezhvocab.txt, which contains 21128 Chinese characters.

We firstly preprocess the book review corpus. We need to specify the model's target in pre-processing stage (--target):

python3 preprocess.py --corpus_path corpora/book_review_bert.txt --vocab_path models/google_zh_vocab.txt --dataset_path dataset.pt \
                      --processes_num 8 --target bert
Notice that
six>=1.12.0
is required.

Pre-processing is time-consuming. Using multiple processes can largely accelerate the pre-processing speed (--processes_num). After pre-processing, the raw text is converted to dataset.pt, which is the input of pretrain.py. Then we download Google's original pre-trained Chinese BERT model googlezhmodel.bin (in UER's format), and put it in models folder. We load the pre-trained Chinese BERT model and train it on book review corpus. Pre-training model is composed of embedding, encoder, and target. To build a pre-training model, we should explicitly specify model's embedding (--embedding), encoder (--encoder and --mask), and target (--target). Suppose we have a machine with 8 GPUs.: ``` python3 pretrain.py --datasetpath dataset.pt --vocabpath models/googlezhvocab.txt --pretrainedmodelpath models/googlezhmodel.bin \ --outputmodelpath models/bookreviewmodel.bin --worldsize 8 --gpuranks 0 1 2 3 4 5 6 7 \ --totalsteps 5000 --savecheckpointsteps 1000 --embedding wordposseg --encoder transformer --mask fullyvisible --target bert

mv models/bookreviewmodel.bin-5000 models/bookreviewmodel.bin ``` --mask specifies the attention mask types. BERT uses bidirectional LM. The word token can attend to all tokens and therefore we use fully_visible mask type. By default, models/bert/base_config.json is used as configuration file, which specifies the model hyper-parameters. Notice that the model trained by pretrain.py is attacted with the suffix which records the training step. We could remove the suffix for ease of use.

Then we fine-tune pre-trained models on downstream classification dataset. We can use googlezhmodel.bin:

python3 run_classifier.py --pretrained_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                          --train_path datasets/douban_book_review/train.tsv --dev_path datasets/douban_book_review/dev.tsv --test_path datasets/douban_book_review/test.tsv \
                          --epochs_num 3 --batch_size 32 --embedding word_pos_seg --encoder transformer --mask fully_visible
or use our bookreviewmodel.bin, which is the output of pretrain.py
python3 run_classifier.py --pretrained_model_path models/book_review_model.bin --vocab_path models/google_zh_vocab.txt \
                          --train_path datasets/douban_book_review/train.tsv --dev_path datasets/douban_book_review/dev.tsv --test_path datasets/douban_book_review/test.tsv \
                          --epochs_num 3 --batch_size 32 --embedding word_pos_seg --encoder transformer --mask fully_visible
It turns out that the result of Google's model is 87.5; The result of bookreviewmodel.bin is 88.2. It is also noticeable that we don't need to specify the target in fine-tuning stage. Pre-training target is replaced with task-specific target.

The default path of the fine-tuned classifier model is models/finetuned_model.bin . Then we do inference with the fine-tuned model.

python3 inference/run_classifier_infer.py --load_model_path models/finetuned_model.bin --vocab_path models/google_zh_vocab.txt \
                                          --test_path datasets/douban_book_review/test_nolabel.tsv \
                                          --prediction_path datasets/douban_book_review/prediction.tsv --labels_num 2 \
                                          --embedding word_pos_seg --encoder transformer --mask fully_visible
--test_path specifies the path of the file to be predicted.
--prediction_path specifies the path of the file with prediction results.
We need to explicitly specify the number of labels by --labels_num. Douban book review is a two-way classification dataset.

We recommend to use CUDAVISIBLEDEVICES to specify which GPUs are visible (all GPUs are used in default). Suppose GPU 0 and GPU 2 are available: ``` python3 preprocess.py --corpuspath corpora/bookreviewbert.txt --vocabpath models/googlezhvocab.txt --datasetpath dataset.pt \ --processesnum 8 --target bert

CUDAVISIBLEDEVICES=0,2 python3 pretrain.py --datasetpath dataset.pt --vocabpath models/googlezhvocab.txt --pretrainedmodelpath models/googlezhmodel.bin \ --outputmodelpath models/bookreviewmodel.bin --worldsize 2 --gpuranks 0 1 \ --totalsteps 5000 --savecheckpointsteps 1000 --embedding wordposseg --encoder transformer --mask fullyvisible --target bert

mv models/bookreviewmodel.bin-5000 models/bookreviewmodel.bin

CUDAVISIBLEDEVICES=0,2 python3 runclassifier.py --pretrainedmodelpath models/bookreviewmodel.bin --vocabpath models/googlezhvocab.txt \ --trainpath datasets/doubanbookreview/train.tsv --devpath datasets/doubanbookreview/dev.tsv --testpath datasets/doubanbookreview/test.tsv \ --outputmodelpath models/classifiermodel.bin \ --epochsnum 3 --batchsize 32 --embedding wordposseg --encoder transformer --mask fully_visible

CUDAVISIBLEDEVICES=0,2 python3 inference/runclassifierinfer.py --loadmodelpath models/classifiermodel.bin --vocabpath models/googlezhvocab.txt \ --testpath datasets/doubanbookreview/testnolabel.tsv \ --predictionpath datasets/doubanbookreview/prediction.tsv --labelsnum 2 \ --embedding wordposseg --encoder transformer --mask fullyvisible ``` Notice that we explicitly specify the fine-tuned model path by *--outputmodelpath* in fine-tuning stage. The actual batch size of pre-training is *--batchsize* times --world_size ; The actual batch size of classification is --batch_size .

BERT consists of next sentence prediction (NSP) target. However, NSP target is not suitable for sentence-level reviews since we have to split a sentence into multiple parts to construct document. UER-py facilitates the use of different targets. Using masked language modeling (MLM) as target could be a properer choice for pre-training of reviews: ``` python3 preprocess.py --corpuspath corpora/bookreview.txt --vocabpath models/googlezhvocab.txt --datasetpath dataset.pt \ --processes_num 8 --target mlm

python3 pretrain.py --datasetpath dataset.pt --vocabpath models/googlezhvocab.txt --pretrainedmodelpath models/googlezhmodel.bin \ --outputmodelpath models/bookreviewmlmmodel.bin --worldsize 8 --gpuranks 0 1 2 3 4 5 6 7 \ --totalsteps 5000 --savecheckpointsteps 2500 --batchsize 64 --embedding wordposseg --encoder transformer --mask fullyvisible --target mlm

mv models/bookreviewmlmmodel.bin-5000 models/bookreviewmlmmodel.bin

CUDAVISIBLEDEVICES=0,1 python3 runclassifier.py --pretrainedmodelpath models/bookreviewmlmmodel.bin --vocabpath models/googlezhvocab.txt \ --trainpath datasets/doubanbookreview/train.tsv --devpath datasets/doubanbookreview/dev.tsv --testpath datasets/doubanbookreview/test.tsv \ --epochsnum 3 --batchsize 64 --embedding wordposseg --encoder transformer --mask fullyvisible

It turns out that the result of [*book_review_mlm_model.bin*](https://share.weiyun.com/V0XidqrV) is around 88.5. 
Different targets require different corpus formats. The format of the corpus for MLM target is as follows (one document per line):
doc1 doc2 doc3 ``` Notice that *corpora/bookreview.txt* (instead of corpora/bookreviewbert.txt) is used when the target is switched to MLM.

BERT is slow. It could be great if we can speed up the model and still achieve competitive performance. To achieve this goal, we select a 2-layers LSTM encoder to substitute 12-layers Transformer encoder. We firstly download reviewslstmlm_model.bin for 2-layers LSTM encoder. Then we fine-tune it on downstream classification dataset: ``` python3 runclassifier.py --pretrainedmodelpath models/reviewslstmlmmodel.bin --vocabpath models/googlezhvocab.txt --configpath models/rnnconfig.json \ --trainpath datasets/doubanbookreview/train.tsv --devpath datasets/doubanbookreview/dev.tsv --testpath datasets/doubanbookreview/test.tsv \ --epochsnum 5 --batchsize 64 --learning_rate 1e-3 --embedding word --encoder lstm --pooling mean

python3 inference/runclassifierinfer.py --loadmodelpath models/finetunedmodel.bin --vocabpath models/googlezhvocab.txt \ --configpath models/rnnconfig.json --testpath datasets/doubanbookreview/testnolabel.tsv \ --predictionpath datasets/doubanbookreview/prediction.tsv \ --labelsnum 2 --embedding word --encoder lstm --pooling mean ``` We can achieve over 85.4 accuracy on testset, which is a competitive result. Using the same LSTM encoder without pre-training can only achieve around 81 accuracy.

UER-py also provides many other encoders and corresponding pre-trained models.
The example of pre-training and fine-tuning ELMo on Chnsenticorp dataset: ``` python3 preprocess.py --corpuspath corpora/chnsenticorp.txt --vocabpath models/googlezhvocab.txt --datasetpath dataset.pt \ --processesnum 8 --seq_length 192 --target bilm

python3 pretrain.py --datasetpath dataset.pt --vocabpath models/googlezhvocab.txt --pretrainedmodelpath models/mixedcorpuselmomodel.bin \ --configpath models/birnnconfig.json \ --outputmodelpath models/chnsenticorpelmomodel.bin --worldsize 8 --gpuranks 0 1 2 3 4 5 6 7 \ --totalsteps 5000 --savecheckpointsteps 2500 --batchsize 64 --learningrate 5e-4 \ --embedding word --encoder bilstm --target bilm

mv models/chnsenticorpelmomodel.bin-5000 models/chnsenticorpelmomodel.bin

python3 runclassifier.py --pretrainedmodelpath models/chnsenticorpelmomodel.bin --vocabpath models/googlezhvocab.txt --configpath models/birnnconfig.json \ --trainpath datasets/chnsenticorp/train.tsv --devpath datasets/chnsenticorp/dev.tsv --testpath datasets/chnsenticorp/test.tsv \ --epochsnum 5 --batchsize 64 --seqlength 192 --learningrate 5e-4 \ --embedding word --encoder bilstm --pooling mean ``` Users can download *mixedcorpuselmomodel.bin* from here.

The example of fine-tuning GatedCNN on Chnsenticorp dataset: ``` python3 runclassifier.py --pretrainedmodelpath models/wikizhgatedcnnlmmodel.bin \ --vocabpath models/googlezhvocab.txt \ --configpath models/gatedcnn9config.json \ --trainpath datasets/chnsenticorp/train.tsv --devpath datasets/chnsenticorp/dev.tsv --testpath datasets/chnsenticorp/test.tsv \ --epochsnum 5 --batchsize 64 --learningrate 5e-5 \ --embedding word --encoder gatedcnn --pooling max

python3 inference/runclassifierinfer.py --loadmodelpath models/finetunedmodel.bin --vocabpath models/googlezhvocab.txt \ --configpath models/gatedcnn9config.json \ --testpath datasets/chnsenticorp/testnolabel.tsv \ --predictionpath datasets/chnsenticorp/prediction.tsv \ --labelsnum 2 --embedding word --encoder gatedcnn --pooling max ``` Users can download *wikizhgatedcnnlmmodel.bin* from here.

UER-py supports cross validation for classification. The example of using cross validation on SMP2020-EWECT, a competition dataset:

CUDA_VISIBLE_DEVICES=0 python3 run_classifier_cv.py --pretrained_model_path models/google_zh_model.bin \
                                                    --vocab_path models/google_zh_vocab.txt \
                                                    --config_path models/bert/base_config.json \
                                                    --output_model_path models/classifier_model.bin \
                                                    --train_features_path datasets/smp2020-ewect/virus/train_features.npy \
                                                    --train_path datasets/smp2020-ewect/virus/train.tsv \
                                                    --epochs_num 3 --batch_size 32 --folds_num 5 \
                                                    --embedding word_pos_seg --encoder transformer --mask fully_visible
The results of googlezhmodel.bin are 79.1/63.8 (Accuracy/Marco F1).
--folds_num specifies the number of rounds of cross-validation.
--output_path specifies the path of the fine-tuned model. --folds_num models are saved and the fold ID suffix is added to the model's name.
--trainfeaturespath specifies the path of out-of-fold (OOF) predictions. runclassifiercv.py generates probabilities over classes on each fold by training a model on the other folds in the dataset. train_features.npy can be used as features for stacking. More details are introduced in Competition solutions section.

We can further try different pre-trained models. For example, we download RoBERTa-wwm-ext-large from HIT and convert it into UER's format: ``` python3 scripts/convertbertfromhuggingfacetouer.py --inputmodelpath models/chineserobertawwmlargeextpytorch/pytorchmodel.bin \ --outputmodelpath models/chineserobertawwmlargeextpytorch/pytorchmodeluer.bin \ --layers_num 24

CUDAVISIBLEDEVICES=0,1 python3 runclassifiercv.py --pretrainedmodelpath models/chineserobertawwmlargeextpytorch/pytorchmodeluer.bin \ --vocabpath models/googlezhvocab.txt \ --configpath models/bert/largeconfig.json \ --trainpath datasets/smp2020-ewect/virus/train.tsv \ --trainfeaturespath datasets/smp2020-ewect/virus/trainfeatures.npy \ --epochsnum 3 --batchsize 64 --foldsnum 5 \ --embedding wordposseg --encoder transformer --mask fullyvisible

The results of *RoBERTa-wwm-ext-large* are 80.3/66.8 (Accuracy/Marco F1). 
The example of using our pre-trained model [*Reviews+BertEncoder(large)+MlmTarget*](https://share.weiyun.com/hn7kp9bs) (see model zoo for more details):
CUDAVISIBLEDEVICES=0,1 python3 runclassifiercv.py --pretrainedmodelpath models/reviewsbertlargemlmmodel.bin \ --vocabpath models/googlezhvocab.txt \ --configpath models/bert/largeconfig.json \ --trainpath datasets/smp2020-ewect/virus/train.tsv \ --trainfeaturespath datasets/smp2020-ewect/virus/trainfeatures.npy \ --foldsnum 5 --epochsnum 3 --batchsize 64 --seed 17 \ --embedding wordposseg --encoder transformer --mask fully_visible ``` The results are 81.3/68.4 (Accuracy/Marco F1), which is very competitive compared with other open-source pre-trained weights. The corpus used by the above pre-trained weight is highly similar with SMP2020-EWECT, a Weibo review dataset.
Sometimes large model does not converge. We need to try different random seeds by specifying --seed.

Besides classification, UER-py also provides scripts for other downstream tasks. We could use run_ner.py for named entity recognition:

python3 run_ner.py --pretrained_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                   --train_path datasets/msra_ner/train.tsv --dev_path datasets/msra_ner/dev.tsv --test_path datasets/msra_ner/test.tsv \
                   --output_model_path models/ner_model.bin \
                   --label2id_path datasets/msra_ner/label2id.json --epochs_num 5 --batch_size 16 \
                   --embedding word_pos_seg --encoder transformer --mask fully_visible
--label2id_path specifies the path of label2id file for named entity recognition. Then we do inference with the fine-tuned ner model:
python3 inference/run_ner_infer.py --load_model_path models/ner_model.bin --vocab_path models/google_zh_vocab.txt \
                                   --test_path datasets/msra_ner/test_nolabel.tsv \
                                   --prediction_path datasets/msra_ner/prediction.tsv \
                                   --label2id_path datasets/msra_ner/label2id.json \
                                   --embedding word_pos_seg --encoder transformer --mask fully_visible

We could use run_cmrc.py for machine reading comprehension:

python3 run_cmrc.py --pretrained_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                    --train_path datasets/cmrc2018/train.json --dev_path datasets/cmrc2018/dev.json \
                    --output_model_path models/cmrc_model.bin \
                    --epochs_num 2 --batch_size 8 --seq_length 512 \
                    --embedding word_pos_seg --encoder transformer --mask fully_visible
We don't specify the --test_path because CMRC2018 dataset doesn't provide labels for testset. Then we do inference with the fine-tuned cmrc model:
python3 inference/run_cmrc_infer.py --load_model_path models/cmrc_model.bin --vocab_path models/google_zh_vocab.txt \
                                    --test_path datasets/cmrc2018/test.json  \
                                    --prediction_path datasets/cmrc2018/prediction.json --seq_length 512 \
                                    --embedding word_pos_seg --encoder transformer --mask fully_visible


Datasets

We collected a range of :arrowright: downstream datasets :arrowleft: and converted them into format that UER can load directly.


Modelzoo

With the help of UER, we pre-trained models with different corpora, encoders, and targets. All pre-trained models can be loaded by UER directly. More pre-trained models will be released in the future. Detailed introduction of pre-trained models and download links can be found in :arrowright: modelzoo :arrowleft: .


Instructions

UER-py's framework

UER-py is organized as follows: ``` UER-py/ |--uer/ | |--encoders/: contains encoders such as RNN, CNN, BERT | |--targets/: contains targets such as language modeling, masked language modeling | |--layers/: contains frequently-used NN layers, such as embedding layer, normalization layer | |--models/: contains model.py, which combines embedding, encoder, and target modules | |--utils/: contains frequently-used utilities | |--modelbuilder.py | |--modelloader.py | |--modelsaver.py | |--trainer.py | |--corpora/: contains corpora for pre-training |--datasets/: contains downstream tasks |--models/: contains pre-trained models, vocabularies, and configuration files |--scripts/: contains useful scripts for pre-training models |--inference/:contains inference scripts for downstream tasks | |--preprocess.py |--pretrain.py |--runclassifier.py |--runclassifiercv.py |--runclassifiermt.py |--runcmrc.py |--runner.py |--rundbqa.py |--runc3.py |--runchid.py |--README.md |--READMEZH.md

The code is well-organized. Users can use and extend upon it with little efforts.

More examples of using UER can be found in :arrow_right: instructions :arrow_left: , which help users quickly implement pre-training models such as BERT, GPT, ELMo, T5 and fine-tune pre-trained models on a range of downstream tasks.


Competition solutions

UER-py has been used in winning solutions of many NLP competitions. In this section, we provide some examples of using UER-py to achieve SOTA results on NLP competitions, such as CLUE. See :arrow_right: competition solutions :arrow_left: for more detailed information.


Citation

If you are using the work (e.g. pre-trained model) in UER-py for academic work, please cite the system paper published in EMNLP 2019:

@article{zhao2019uer, title={UER: An Open-Source Toolkit for Pre-training Models}, author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong}, journal={EMNLP-IJCNLP 2019}, pages={241}, year={2019} } ```


Contact information

For communication related to this project, please contact Zhe Zhao ([email protected]; [email protected]) or Yudong Li ([email protected]) or Xin Zhao ([email protected]).

This work is instructed by my enterprise mentors Qi Ju, Xuefeng Yang, Haotang Deng and school mentors Tao Liu, Xiaoyong Du.

We also got a lot of help from my Tencent colleagues Hui Chen, Jinbin Zhang, Zhiruo Wang, Weijie Liu, Peng Zhou, Haixiao Liu, and Weijian Wu.

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.