Transformer language model (GPT-2) with sentencepiece tokenizer
.. image:: https://img.shields.io/travis/lopuhin/transformer-lm/master.svg :target: https://travis-ci.org/lopuhin/transformer-lm :alt: Build Status
Training GPT-2 transformer language model on your own corpora with
This repo contains a PyTorch implementation of GPT-2, which support multi-GPU training. It also contains a TensorFlow implementation in
lm/gpt_2_tf, but it is not developed any more. They share the same data preparation scripts. TF training command is
gpt-2-tf-trainand needs TensorFlow 1.13. Documentation below is for PyTorch version.
Python 3.6+ is required with torch nightly or 1.6.0+. Working in a virtualenv is assumed below.
Install__ appropriate version of pytorch first, and then::
pip install -r requirements.txt python setup.py develop
Instructions are below. See also
test/test_shakespeare.shfor a complete pipeline demo on a small corpus (takes a minute on a CPU).
Prepare data for training +++++++++++++++++++++++++
Corpus format: a directory with top-level
testfolders. Each top-level folder may contain sub-folders. Inside them, there must be utf-8 encoded text files with
The commands to train sentencepiece model and encode the corpus support multiple corpora, in below examples we assume they can be listed as
Train sentencepiece model (
sp-text.txtcan be removed after running). This can consume a large amount of memory, adjust sentencepiece arguments as advised if needed (this is not supported in the
sp-train data/corpora-* sp-text.txt sp-model
Encode corpora, producing numpy files::
sp-encode data/corpora-* sp-model.model data/encoded
gpt-2 run-root data/encoded sp-model.model
run-rootwould contain model checkpoints and json-lines logs, which can be plotted in a jupyter notebook with
json_log_plots.plot("run-root"), with number of tokens seen on the X axis.
Default hyperparameters correspond to released "small" GPT-2 model.
When multiple GPUs are available, they would be used for training with the help of
If the path exists and
--cleankey is NOT passed, training would be resumed. Note that all parameters still need to be specified and model parameters need to match.
Notes on training parameters:
--batch-sizeis per-GPU, so you don't need to re-tune it when changing number of GPUs, just use max that fits into memory.
--g-accum-gradientsis the global number of gradient accumulations, it must be divisible by the number of GPUs. Effective global batch size is always
batch_size * g_accum_gradients.
--lrdoes not need to be changed when changing
--g-accum-gradientsor number of GPUs or
--n-ctx: loss is already scaled appropriately.
gpt-2-gen run-root "Artificial intelligence"
run-rootwould contain model checkpoints
"Artificial intelligence"is the text prefix used as a starting point for generating tokens
Notes on inference parameters:
--tokens-to-generate: number of tokens to generate, default is 42
--top-k: number of token candidates to generate for each position (beam width), default is 8.
License is MIT.
TensorFlow GPT-2 model is taken from https://github.com/openai/gpt-2/blob/master/src/model.py and TensorFlow GPT-2 training code is based on https://github.com/nshepperd/gpt-2/blob/finetuning/train.py
PyTorch port is based on original OpenAI code.
Test Shakespeare corpus under
tests/shakespeareis from http://shakespeare.mit.edu under public domain.
See also OpenAI GPT-2