Pretrain and finetune ELECTRA with fastai and huggingface. (Results of the paper replicated !)
Unofficial PyTorch implementation of
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators by Kevin Clark. Minh-Thang Luong. Quoc V. Le. Christopher D. Manning
I pretrain ELECTRA-small from scratch and have successfully replicated the paper's results on GLUE.
|Model|CoLA|SST|MRPC|STS|QQP|MNLI|QNLI|RTE|Avg. of Avg.| |---|---|---|---|---|---|---|---|---|---| |ELECTRA-Small-OWT|56.8|88.3|87.4|86.8|88.3|78.9|87.9|68.5|80.36| |ELECTRA-Small-OWT (my)|58.72|88.03|86.04|86.16|88.63|80.4|87.45|67.46|80.36
Table 1: Results on GLUE dev set. The official result comes from expected results. Scores are the average scores finetuned from the same checkpoint. (See this issue) My result comes from pretraining a model from scratch and thens taking average from 10 finetuning runs for each task. Both results are trained on OpenWebText corpus
|Model|CoLA|SST|MRPC|STS|QQP|MNLI|QNLI|RTE|Avg.| |---|---|---|---|---|---|---|---|---|---| |ELECTRA-Small++|55.6|91.1|84.9|84.6|88.0|81.6|88.3|6.36|79.7| |ELECTRA-Small++ (my)|54.8|91.6|84.6|84.2|88.5|82|89|64.7|79.92
Table 2: Results on GLUE test set. My result finetunes the pretrained checkpoint loaded from huggingface.
Official training loss curve |
My training loss curve |
---|---|
![]() |
![]() |
Table 3: Both are small models trained on OpenWebText. The official one is from here. You should take the value of training loss with a grain of salt since it doesn't reflect the performance of downstream tasks.
You don't need to download and process datasets manually, the scirpt take care those for you automatically. (Thanks to huggingface/datasets and hugginface/transformers)
AFAIK, the closest reimplementation to the original one, taking care of many easily overlooked details (described below).
AFAIK, the only one successfully validate itself by replicating the results in the paper.
Comes with jupyter notebooks, which you can explore the code and inspect the processed data.
You don't need to download and preprocess anything by yourself, all you need is running the training script.
|Mean|Std|Max|Min|#models| |---|---|---|---|---| |81.38|0.57|82.23|80.42|14|
Tabel 4: Statistics of GLUE devset results for small models. Every model is pretrained from scratch with different seeds and finetuned for 10 random runs for each GLUE task. Score of a model is the average of the best of 10 for each task. (The process is as same as the one described in the paper) As we can see, although ELECTRA is mocking adeversarial training, it has a good training stability.
|Model|CoLA|SST|MRPC|STS|QQP|MNLI|QNLI|RTE| |---|---|---|---|---|---|---|---|---| |ELECTRA-Small-OWT (my)|1.30|0.49|0.7|0.29|0.1|0.15|0.33|1.93
Table 5: Standard deviation for each task. This is the same model as Table 1, which finetunes 10 runs for each task.
HuggingFace forum post
Fastai forum post
Note: This project is actually for my personal research. So I didn't trying to make it easy to use for all users, but trying to make it easy to read and modify.
pip3 install -r requirements.txt
python pretrain.py
pretrained_checkcpointin
finetune.pyto use the checkpoint you've pretrained and saved in
electra_pytorch/checkpoints/pretrain.
python finetune.py(with
do_finetuneset to
True)
th_runsin
finetune.pyaccording to the numbers in the names of runs you picked.
python finetune.py(with
do_finetuneset to
False), this outpus predictions on testset, you can then compress and send
.tsvs in
electra_pytorch/test_outputs//*.tsvto GLUE site to get test score.
I didn't use CLI arguments, so configure options enclosed within
MyConfigin the python files to your needs before run them. (There're comments below it showing the options for vanilla settings)
You will need a Neptune account and create a neptune project on the website to record GLUE finetuning results. Don't forget to replace
richarddwang/electra-gluewith your neptune project's name
The python files
pretrian.py,
finetune.pyare in fact converted from
Pretrain.ipynband
Finetune_GLUE.ipynb. You can also use those notebooks to explore ELECTRA training and finetuning.
Below lists the details of the original implementation/paper that are easy to be overlooked and I have taken care of. I found these details are indispensable to successfully replicate the results of the paper.
ElectraClassificationHeaduses.
If you pretrain, finetune, and generate test results.
electra_pytorchwill generate these for you.
project root | |── datasets | |── glue | |── | ... | |── checkpoints | |── pretrain | | |── __.pth | | ... | | | |── glue | |── __.pth | ... | |── test_outputs | |── | | |── CoLA.tsv | | ... | | | | ...
@inproceedings{clark2020electra, title = {{ELECTRA}: Pre-training Text Encoders as Discriminators Rather Than Generators}, author = {Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning}, booktitle = {ICLR}, year = {2020}, url = {https://openreview.net/pdf?id=r1xMH1BtvB} }
I will join RC2020 so maybe there will be another paper for this implementation then. Be sure to check here again when you cite this implementation.
@misc{electra_pytorch, author = {Richard Wang}, title = {PyTorch implementation of ELECTRA}, year = {2020}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/richarddwang/electra_pytorch}} }