multi gpu training in one machine for BERT from scratch without horovod
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
More gpu means more data in a batch. And the gradients of a batch data is averaged for back-propagation. So more gpu finally means a lower learning rate. Lower learning rate result in better pre-training performance.
0, edit the input and output file name in
batchsize is the batchsize per GPU, not the globalbatchsize
sample_text.txt, sentence is end by \n, paragraph is splitted by empty line.
Quora question pairs English dataset,
Official BERT: ACC 91.2, AUC 96.9
This BERT with pretrain loss 2.05: ACC 90.1, AUC 96.3