Need help with luke?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

studio-ousia
189 Stars 23 Forks Apache License 2.0 182 Commits 2 Opened issues

Description

LUKE -- Language Understanding with Knowledge-based Embeddings

Services available

!
?

Need anything else?

Contributors list

LUKE

CircleCI


LUKE (Language Understanding with Knowledge-based Embeddings) is a new pre-trained contextualized representation of words and entities based on transformer. It achieves state-of-the-art results on important NLP benchmarks including SQuAD v1.1 (extractive question answering), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), TACRED (relation classification), and Open Entity (entity typing).

This repository contains the source code to pre-train the model and fine-tune it to solve downstream tasks.

News

November 5, 2021: LUKE-500K (base) model

We released LUKE-500K (base), a new pretrained LUKE model which is smaller than existing LUKE-500K (large). The experimental results of the LUKE-500K (base) and LUKE-500K (large) on SQuAD v1 and CoNLL-2003 are shown as follows:

| Task | Dataset | Metric | LUKE-500K (base) | LUKE-500K (large) | | ----------------------------- | ------------------------------------------------------------ | ------ | ---------------- | ----------------- | | Extractive Question Answering | SQuAD v1.1 | EM/F1 | 86.1/92.3 | 90.2/95.4 | | Named Entity Recognition | CoNLL-2003 | F1 | 93.3 | 94.3 |

We tuned only the batch size and learning rate in the experiments based on LUKE-500K (base).

Comparison with State-of-the-Art

LUKE outperforms the previous state-of-the-art methods on five important NLP tasks:

| Task | Dataset | Metric | LUKE-500K (large) | Previous SOTA | | ------------------------------ | ---------------------------------------------------------------------------- | ------ | ----------------- | ------------------------------------------------------------------------- | | Extractive Question Answering | SQuAD v1.1 | EM/F1 | 90.2/95.4 | 89.9/95.1 (Yang et al., 2019) | | Named Entity Recognition | CoNLL-2003 | F1 | 94.3 | 93.5 (Baevski et al., 2019) | | Cloze-style Question Answering | ReCoRD | EM/F1 | 90.6/91.2 | 83.1/83.7 (Li et al., 2019) | | Relation Classification | TACRED | F1 | 72.7 | 72.0 (Wang et al. , 2020) | | Fine-grained Entity Typing | Open Entity | F1 | 78.2 | 77.6 (Wang et al. , 2020) |

These numbers are reported in our EMNLP 2020 paper.

Installation

LUKE can be installed using Poetry:

$ poetry install

The virtual environment automatically created by Poetry can be activated by

poetry shell
.

Released Models

We initially release the pre-trained model with 500K entity vocabulary based on the

roberta.large
model.

| Name | Base Model | Entity Vocab Size | Params | Download | | --------------------- | --------------------------------------------------------------------------------------------------- | ----------------- | ------ | ------------------------------------------------------------------------------------------ | | LUKE-500K (base) | roberta.base | 500K | 253 M | Link | | LUKE-500K (large) | roberta.large | 500K | 483 M | Link |

Reproducing Experimental Results

The experiments were conducted using Python3.6 and PyTorch 1.2.0 installed on a server with a single or eight NVidia V100 GPUs. We used NVidia's PyTorch Docker container 19.02. For computational efficiency, we used mixed precision training based on APEX library which can be installed as follows:

$ git clone https://github.com/NVIDIA/apex.git
$ cd apex
$ git checkout c3fad1ad120b23055f6630da0b029c8b626db78f
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

The APEX library is not needed if you do not use

--fp16
option or reproduce the results based on the trained checkpoint files.

The commands that reproduce the experimental results are provided as follows:

Entity Typing on Open Entity Dataset

Dataset: Link\ Checkpoint file (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir= \
    entity-typing run \
    --data-dir= \
    --checkpoint-file= \
    --no-train

Fine-tuning the model:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir= \
    entity-typing run \
    --data-dir= \
    --train-batch-size=2 \
    --gradient-accumulation-steps=2 \
    --learning-rate=1e-5 \
    --num-train-epochs=3 \
    --fp16

Relation Classification on TACRED Dataset

Dataset: Link\ Checkpoint file (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir= \
    relation-classification run \
    --data-dir= \
    --checkpoint-file= \
    --no-train

Fine-tuning the model:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir= \
    relation-classification run \
    --data-dir= \
    --train-batch-size=4 \
    --gradient-accumulation-steps=8 \
    --learning-rate=1e-5 \
    --num-train-epochs=5 \
    --fp16

Named Entity Recognition on CoNLL-2003 Dataset

Dataset: Link\ Checkpoint file (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir= \
    ner run \
    --data-dir= \
    --checkpoint-file= \
    --no-train

Fine-tuning the model:

$ python -m examples.cli\
    --model-file=luke_large_500k.tar.gz \
    --output-dir= \
    ner run \
    --data-dir= \
    --train-batch-size=2 \
    --gradient-accumulation-steps=2 \
    --learning-rate=1e-5 \
    --num-train-epochs=5 \
    --fp16

Cloze-style Question Answering on ReCoRD Dataset

Dataset: Link\ Checkpoint file (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir= \
    entity-span-qa run \
    --data-dir= \
    --checkpoint-file= \
    --no-train

Fine-tuning the model:

$ python -m examples.cli \
    --num-gpus=8 \
    --model-file=luke_large_500k.tar.gz \
    --output-dir= \
    entity-span-qa run \
    --data-dir= \
    --train-batch-size=1 \
    --gradient-accumulation-steps=4 \
    --learning-rate=1e-5 \
    --num-train-epochs=2 \
    --fp16

Extractive Question Answering on SQuAD 1.1 Dataset

Dataset: Link\ Checkpoint file (compressed): Link\ Wikipedia data files (compressed): Link

Using the checkpoint file:

$ python -m examples.cli \
    --model-file=luke_large_500k.tar.gz \
    --output-dir= \
    reading-comprehension run \
    --data-dir= \
    --checkpoint-file= \
    --no-negative \
    --wiki-link-db-file=enwiki_20160305.pkl \
    --model-redirects-file=enwiki_20181220_redirects.pkl \
    --link-redirects-file=enwiki_20160305_redirects.pkl \
    --no-train

Fine-tuning the model:

$ python -m examples.cli \
    --num-gpus=8 \
    --model-file=luke_large_500k.tar.gz \
    --output-dir= \
    reading-comprehension run \
    --data-dir= \
    --no-negative \
    --wiki-link-db-file=enwiki_20160305.pkl \
    --model-redirects-file=enwiki_20181220_redirects.pkl \
    --link-redirects-file=enwiki_20160305_redirects.pkl \
    --train-batch-size=2 \
    --gradient-accumulation-steps=3 \
    --learning-rate=15e-6 \
    --num-train-epochs=2 \
    --fp16

Citation

If you use LUKE in your work, please cite the original paper:

@inproceedings{yamada2020luke,
  title={LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention},
  author={Ikuya Yamada and Akari Asai and Hiroyuki Shindo and Hideaki Takeda and Yuji Matsumoto},
  booktitle={EMNLP},
  year={2020}
}

Contact Info

Please submit a GitHub issue or send an e-mail to Ikuya Yamada (

[email protected]
) for help or issues using LUKE.

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.