A deep NLP library, based on Keras / tf, focused on question answering (but useful for other NLP too)
DeepQA is built on top of Keras. We've decided that pytorch is a better platform for NLP research. We re-wrote DeepQA into a pytorch library called AllenNLP. There will be no more development of DeepQA. But, we're pretty excited about AllenNLP - if you're doing deep learning for natural language processing, you should check it out!
DeepQA is a library for doing high-level NLP tasks with deep learning, particularly focused on various kinds of question answering. DeepQA is built on top of Keras and TensorFlow, and can be thought of as an interface to these systems that makes NLP easier.
Specifically, this library provides the following benefits over plain Keras / TensorFlow:
DeepQA is built using Python 3. The easiest way to set up a compatible environment is to use Conda. This will set up a virtual environment with the exact version of Python used for development along with all the dependencies needed to run DeepQA.
Create a Conda environment with Python 3.
conda create -n deep_qa python=3.5
Now activate the Conda environment.
source activate deep_qa
Install the required dependencies.
./scripts/install_requirements.sh
Set the
PYTHONHASHSEEDfor repeatable experiments.
export PYTHONHASHSEED=2157
You should now be able to test your installation with
pytest -v. Congratulations! You now have a development environment for deep_qa that uses TensorFlow with CPU support. (For GPU support, see requirements.txt for information on how to install
tensorflow-gpu).
To train or evaluate a model using a clone of the DeepQA repository, the recommended entry point is to use the
run_model.pyscript. The first argument to that script is a parameter file, described more below. The second argument determines the behavior, either training a model or evaluating a trained model against a test dataset. Current valid options for the second argument are
trainand
test(omitting the argument is the same as passing
train).
Parameter files specify the model class you're using, model hyperparameters, training details, data files, data generator details, and many other things. You can see example parameter files in the examples directory. You can get some notion of what parameters are available by looking through the documentation.
Actually training a model will require input files, which you need to provide. We have a companion library, DeepQA Experiments, which was originally designed to produce input files and run experiments, and can be used to generate required data files for most of the tasks we have models for. We're moving towards putting the data processing code directly into DeepQA, so that DeepQA Experiments is not necessary, but for now, getting training data files in the right format is most easily done with DeepQA Experiments.
If you are using DeepQA as a library in your own code, it is still straightforward to run your model. Instead of using the
run_model.pyscript to do the training/evaluation, you can do it yourself as follows:
from deep_qa import run_model, evaluate_model, load_model, score_datasetTrain a model given a json specification
run_model("/path/to/json/parameter/file")
Load a model given a json specification
loaded_model = load_model("/path/to/json/parameter/file")
Do some more exciting things with your model here!
Get predictions from a pre-trained model on some test data specified in the json parameters.
predictions = score_dataset("/path/to/json/parameter/file")
Compute your own metrics, or do beam search, or whatever you want with the predictions here.
Compute Keras' metrics on a test dataset, using a pre-trained model.
evaluate_model("/path/to/json/parameter/file", ["/path/to/data/file"])
The rest of the usage guidelines, examples, etc., are the same as when working in a clone of the repository.
To implement a new model in DeepQA, you need to subclass
TextTrainer. There is documentation on what is necessary for this; see in particular the Abstract methods section. For a simple example of a fully functional model, see the simple sequence tagger, which has about 20 lines of actual implementation code.
In order to train, load and evaluate models which you have written yourself, simply pass an additional argument to the functions above and remove the
model_classparameter from your json specification. For example: ``` from deepqa import runmodel from .local_project import MyGreatModel
runmodel("/path/to/json/parameter/file", modelclass=MyGreatModel) ```
If you're doing a new task, or a new variant of a task with a different input/output specification, you probably also need to implement an
Instancetype. The
Instancehandles reading data from a file and converting it into numpy arrays that can be used for training and evaluation. This only needs to happen once for each input/output spec.
DeepQA has implementations of state-of-the-art methods for a variety of tasks. Here are a few of them:
This code allows for easy experimentation with the following datasets:
Note that the data processing code for most of this currently lives in DeepQA Experiments, however.
If you use this code and think something could be improved, pull requests are very welcome. Opening an issue is ok, too, but we can respond much more quickly to pull requests.
This code is released under the terms of the Apache 2 license.