Pytorch implementation of LSTM-CRF for named entity recognition
This repository implements an LSTM-CRF model for named entity recognition. The model is same as the one by Lample et al., (2016) except we do not have the last
tanhlayer after the BiLSTM. We achieve the SOTA performance on both CoNLL-2003 and OntoNotes 5.0 English datasets (check our benchmark with Glove and ELMo, other and benchmark results with fine-tuning BERT).
Announcement: Benchmark results by fine-tuning BERT/Roberta
| Model| Dataset | Precision | Recall | F1 | |-------| ------- | :---------: | :------: | :--: | |BERT-base-cased + CRF (this repo)| CONLL-2003 | 91.69 | 92.05 | 91.87 | |Roberta-base + CRF (this repo)| CoNLL-2003 | 91.88 | 93.01 |92.44| |BERT-base-cased + CRF (this repo)| OntoNotes 5 |89.57 | 89.45 | 89.51 | |Roberta-base + CRF (this repo)| OntoNotes 5 | 90.12 | 91.25 |90.68|
More details
Update: Our latest breaking change: using data loader to read all data and convert the data into tensor. We latest release also use HuggingFace's transformers but we didn't adopt to use the PyTorch
Datasetand
DataLoaderyet. This version uses both and we are also testing the correctness of the code before publishing a new release.
If you use
conda:
git clone https://github.com/allanj/pytorch_lstmcrf.gitpython > 3.6
conda create -n pt_lstmcrf python=3.6 conda activate pt_lstmcrf
kindly check https://pytorch.org for the suitable version of your machines
conda install pytorch torchvision cudatoolkit=10.0 -c pytorch -n pt_lstmcrf pip install tqdm pip install termcolor pip install overrides pip install allennlp ## required when we need to get the ELMo vectors pip install transformers
In the documentation below, we present two ways for users to run the code: 1. Run the model via (Fine-tuning) BERT/Roberta/etc in Transformers package. 2. Run the model with simply word embeddings (and static ELMo/BERT representations loaded from external vectors).
Our default argument setup refers to the first one
1.
embedder_typeargument with the model in HuggingFace. For example, if we are using
bert-base-cased, we just need to change the embedder type as
bert-base-cased.
bash python transformers_trainer.py --device=cuda:0 --dataset=YourData --model_folder=saved_models --embedder_type=bert-base-cased
config/transformers_util.py. If not, add to the utils. For example, if you would like to use
BERT-Large. Add the following line to the dictionary.
python 'bert-large-cased' : { "model": BertModel, "tokenizer" : BertTokenizer }This name
bert-large-casedhas to follow the naming rule by HuggingFace.
embedder_type:
bash python trainer.py --embedder_type=bert-large-casedThe default value for
embedder_typeis
normal, which refers to the classic LSTM-CRF and we can use
static_context_embin previous section. Changing the name to something like
bert-base-casedor
roberta-base, we directly load the model from huggingface. Note: if you use other models, remember to replace the tokenization mechanism in
config/utils.py.
config/transformers_util.py
model/embedder/transformers_embedder.py
model/transformers_embedder.pyand uncomment the following:
python self.model.requires_grad = False
Using Word embedding or external contextualized embedding (ELMo/BERT) can be found in here.
YourDataunder the data directory.
train.txt,
dev.txtand
test.txtfiles (make sure the format is compatible, i.e. the first column is words and the last column are tags) under this directory. If you have a different format, simply modify the reader in
config/reader.py.
datasetargument to
YourDatawhen you run
trainer.py.
importstuff)
A huge thanks to @yuchenlin for his contribution in this repo.