Need help with UnsupervisedQA?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

facebookresearch
172 Stars 44 Forks Other 11 Commits 1 Opened issues

Description

Unsupervised Question answering via Cloze Translation

Services available

!
?

Need anything else?

Contributors list

UnsupervisedQA

Code, Data and models supporting the experiments in the ACL 2019 Paper: Unsupervised Question Answering by Cloze Translation.

Obtaining training data for Question Answering (QA) is time-consuming and resource-intensive, and existing QA datasets are only available for limited domains and languages. In this work, we take some of the first steps towards unsupervised QA, and develop an approach that, without using the SQuAD training data at all, achieves 56.4 F1 on SQuAD v1.1, and 64.5 F1 when the answer is a named entity mention.


figure


This repository provides code to run pre-trained models to generate sythetic question answering question data. We also make a very large synthetic training dataset for extractive question answering available.

Dataset Downloads

We make available a dataset of 4 million SQuAD-like question answering datapoints, automatically generated by the unsupervised system described in the system.

The data can be downloaded here. The data is in the SQuAD v1 format, and contains:

| Fold | # Paragraphs | # QA pairs | | :-----------------: | :-----------: | :-----------: | |

unsupervised_qa_train.json
| 782,556 | 3,915,498 | |
unsupervised_qa_dev.json
| 1,000 | 4,795 | |
unsupervised_qa_test.json
| 1,000 | 4,804 |

Using this training data to fine-tune BERT-Large for reading comprehension will achieve over 50.0 F1 on the SQuAD V1.1 development set using an appropriate early stopping strategy on the unsupervised_qa dev set.

Models and Code

In addition the above data, this repository provides functionality to generate synthetic training data from user-provided documents

Installation:

The code is built to run on top of UnsupervisedMT, and requires all of its its dependencies. Additional requirements are spaCy (for NER and noun chunking), attrs, and NLTK and allennlp (for constituency parsing). It was developed to run on Ubuntu Linux 18.04 and Python 3.7, with CUDA 9

(Optionally) Create a conda environment to keep things clean:

conda create -n uqa37 python=3.7 && conda activate uqa37

The recommended way to install is shown below, which should install and handle all dependencies: ```

clone the repo

git clone https://github.com/facebookresearch/UnsupervisedQA.git cd UnsupervisedQA

install python dependencies:

pip install -r requirements.txt

install UnsupervisedMT and its dependencies

./install_tools.sh ```

Models:

Four UNMT models are made available for download

  • Sentence Cloze boundaries, Noun Phrase Answers
  • Sentence Cloze boundaries, Named Entity Answers
  • Sub-clause Cloze boundaries, Named Entity Answers
  • Sub-cluase Cloze boundaries, Named Entity Answers, Wh Heuristics (best downstream performance)

The models can be downloaded using the script:

./download_models.sh

This will download all the models and unzip them to the appropriate directory. Each unzipped model is about 850MB, so total space requirement is 3.5GB.

Usage:

You can generate reading comprehension training data using

unsupervisedqa.generate_synthetic_qa_data

This script will allow you to generate unsupervised question answering data using the

identity
,
noisy cloze
or
unsupervised NMT
methods explored in the paper, as well as specifying several different configurations (i.e. whether to use subclause shortening, use named entity answers and whether to use the wh heuristic).

This script provides the following command line arguments:

usage: generate_synthetic_qa_data.py [-h] [--input_file_format {txt,jsonl}]
                                     [--output_file_formats OUTPUT_FILE_FORMATS]
                                     [--translation_method {identity,noisy_cloze,unmt}]
                                     [--use_subclause_clozes]
                                     [--use_named_entity_clozes]
                                     [--use_wh_heuristic]
                                     input_file output_file

Generate synthetic training data for extractive QA tasks without supervision

positional arguments: input_file input file, see readme for formatting info output_file Path to write generated data to, see readme for formatting info

optional arguments: -h, --help show this help message and exit --input_file_format {txt,jsonl} input file format, see readme for more info, default is txt --output_file_formats OUTPUT_FILE_FORMATS comma-seperated list of output file formats, from [jsonl, squad], an output file will be created for each format. Default is 'jsonl,squad' --translation_method {identity,noisy_cloze,unmt} define the method to generate clozes -- either the Unsupervised NMT method (unmt), or the identity or noisy cloze baseline methods. UNMT is recommended for downstream performance, but the noisy_cloze is relatively stong on downstream QA and fast to generate. Default is unmt --use_subclause_clozes pass this flag to shorten clozes with constituency parsing instead of using sentence boundaries (recommended for downstream performance) --use_named_entity_clozes pass this flag to use named entity answer prior instead of noun phrases (recommended for downstream performance) --use_wh_heuristic pass this flag to use the wh-word heuristic (recommended for downstream performance). Only compatable with named entity clozes

The input format is specified by the

--input_file format
argument, and can either be a
.txt
file of paragraphs, one per line, for questions and answers to be generated from, or a
.jsonl
file with each line containing a json-serialised dict of the format
{"text": text of paragraph, "paragraph_id" : your unique identifier for the paragraph}

The output format can be specified by the user using the

--output_file_formats
argument. The user can choose between
jsonl
and
squad
format. Requesting the
squad
format will output a file using the squad v1.1 format, ready to be plugged into downstream extractive QA tasks. The
jsonl
format provides more metadata than the squad format, the fields are explained below:
{
    "cloze_id": unique identifier for this datapoint
    "paragraph": data on the paragraph this datapoint was generated from
    "source_text": the text from the paragraph the cloze was generated from
    "source_start": character index in paragraph where "source_text" starts
    "cloze_text": the text of the cloze question the question is generated from
    "answer_text": the answer text of the (cloze) question
    "answer_start": the character index that the answer starts at in the paragraph
    "constituency_parse": the constituency parse of the "source_text" if available, otherwise null,
    "root_label": the node label of the root of the constituency parse if available, otherwise null,
    "answer_type": The named entity label of the answer (if using named entity clozes) otherwise "NOUNPHRASE"
    "question_text": the text of the natural question, translated from "cloze_text"
}

A working example to produce unsupervised NMT-translated questions using the model trained with wh heuristics, named entity answers, subclause shortening is below:

python -m unsupervisedqa.generate_synthetic_qa_data example_input.txt example_output \
    --input_file_format "txt" \
    --output_file_format "jsonl,squad" \
    --translation_method unmt \
    --use_named_entity_clozes \
    --use_subclause_clozes \
    --use_wh_heuristic 

I'm running out of GPU memory

The repository requires a CUDA-enabled GPU (this is a requirement of UnsupervisedMT), but you can reduce the amount of GPU memory required by adjusting the batch sizes. This can be done by modifying

unsupervisedqa/configs.py
file, adjusting
CONSTITUENCY_BATCH_SIZE
and
UNMT_BATCH_SIZE
.

Training Your own question translation models

This repository only provides functionality to run pre-trained unsupervised question translation models in the paper. For users who want to train new question translation models, they should use the training functionality in UnsupervisedMT, or consider the newer and more powerful XLM repository.

To train question translation models in UnsupervisedMT, first prepare large corpora of cloze questions (potentially using the functionality in this repository) and a large corpus of natural questions. Preprocess these corpora by adapting UnsupervisedMT/NMT/getdataenfr.sh, and train using the example script in UnsupervisedMT/README, with appropriate edits to the args (e.g en->cloze and fr->question) and paths.

References

Please cite [1] and [2] if you found the resources in this repository useful.

Unsupervised Question Answering by Cloze Translation

[1] P. Lewis, L. Denoyer, S. Riedel Unsupervised Question Answering by Cloze Translation

@inproceedings{lewis2019unsupervisedqa,
  title={Unsupervised Question Answering by Cloze Translation},
  author={Lewis, Patrick and Denoyer, Ludovic and Riedel, Sebastian},
  booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  year={2019}
}

Phrase-Based & Neural Unsupervised Machine Translation

[2] G. Lample, M. Ott, A. Conneau, L. Denoyer, MA. Ranzato Phrase-Based & Neural Unsupervised Machine Translation

@inproceedings{lample2018phrase,
  title={Phrase-Based \& Neural Unsupervised Machine Translation},
  author={Lample, Guillaume and Ott, Myle and Conneau, Alexis and Denoyer, Ludovic and Ranzato, Marc'Aurelio},
  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2018}
}

License

See the LICENSE file for more details.

Troubleshooting

If you run into problems with installing dependencies (particularly allennlp) installing libffi may help:

apt-get install libffi6 libffi-dev

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.