Computer vision torch Lua Deep learning
Need help with visdial?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.


[CVPR 2017] Torch code for Visual Dialog

212 Stars 64 Forks Other 141 Commits 2 Opened issues

Services available

Need anything else?


Code for the paper

Visual Dialog
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, Dhruv Batra
CVPR 2017 (Spotlight)

Visual Dialog requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Given an image, dialog history, and a follow-up question about the image, the AI agent has to answer the question.


This repository contains code for training, evaluating and visualizing results for all combinations of encoder-decoder architectures described in the paper. Specifically, we have 3 encoders: Late Fusion (LF), Hierarchical Recurrent Encoder (HRE), Memory Network (MN), and 2 kinds of decoding: Generative (G) and Discriminative (D).


If you find this code useful, consider citing our work:

  title={{V}isual {D}ialog},
  author={Abhishek Das and Satwik Kottur and Khushi Gupta and Avi Singh
    and Deshraj Yadav and Jos\'e M.F. Moura and Devi Parikh and Dhruv Batra},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},


All our code is implemented in Torch (Lua). Installation instructions are as follows:

git clone ~/torch --recursive
cd ~/torch; bash install-deps;

Additionally, our code uses the following packages: torch/torch7, torch/nn, torch/nngraph, Element-Research/rnn, torch/image, lua-cjson, loadcaffe, torch-hdf5. After Torch is installed, these can be installed/updated using:

luarocks install torch
luarocks install nn
luarocks install nngraph
luarocks install image
luarocks install lua-cjson
luarocks install loadcaffe
luarocks install luabitop
luarocks install totem


luarocks install rnn
defaults to torch/rnn, follow these steps to install Element-Research/rnn.
git clone
cd rnn
luarocks make rocks/rnn-scm-1.rockspec

Installation instructions for torch-hdf5 are given here.

NOTE: torch-hdf5 does not work with few versions of gcc. It is recommended that you use gcc 4.8 / gcc 4.9 with Lua 5.1 for proper installation of torch-hdf5.

Running on GPUs

Although our code should work on CPUs, it is highly recommended to use GPU acceleration with CUDA. You'll also need torch/cutorch, torch/cudnn and torch/cunn.

luarocks install cutorch
luarocks install cunn
luarocks install cudnn

Training your own network

Preprocessing VisDial

The preprocessing script is in Python, and you'll need to install NLTK.

pip install nltk
pip install numpy
pip install h5py
python -c "import nltk;'all')"

VisDial v1.0 dataset can be downloaded and preprocessed as specified below. The path provided as

must have four subdirectories -
as per COCO dataset,
which can be downloaded from here.
cd data
python -download -image_root /path/to/images
cd ..

To download and preprocess Visdial v0.9 dataset, provide an extra

-version 0.9
argument while execution.

This script will generate the files

(contains tokenized captions, questions, answers, image indices) and
(contains vocabulary mappings and COCO image ids).

Extracting image features

Since we don't finetune the CNN, training is significantly faster if image features are pre-extracted. Currently this repository provides support for extraction from VGG-16 and ResNets. We use image features from VGG-16. The VGG-16 model can be downloaded and features extracted using:

sh scripts/ vgg 16  # works for 19 as well
cd data
# For all models except mn-att-ques-im-hist
th prepro_img_vgg16.lua -imageRoot /path/to/images -gpuid 0
# For mn-att-ques-im-hist
th prepro_img_vgg16.lua -imageRoot /path/to/images -imgSize 448 -layerName pool5 -gpuid 0

Similarly, ResNet models released by Facebook can be used for feature extraction. Feature extraction can be carried out in a similar manner as VGG-16:

sh scripts/ resnet 200  # works for 18, 34, 50, 101, 152 as well
cd data
th prepro_img_resnet.lua -imageRoot /path/to/images -cnnModel /path/to/t7/model -gpuid 0

Running either of these should generate

containing features for
splits corresponding to VisDial v1.0.


Finally, we can get to training models! All supported encoders are in the

folder (
), and decoders in the
folder (

Generative (

) decoding tries to maximize likelihood of ground-truth response and only has access to single input-output pairs of dialog, while discriminative (
) decoding makes use of 100 candidate option responses provided for every round of dialog, and maximizes likelihood of correct option.

Encoders and decoders can be arbitrarily plugged together. For example, to train an HRE model with question and history information only (no images), and generative decoding:

th train.lua -encoder hre-ques-hist -decoder gen -gpuid 0

Similarly, to train a Memory Network model with question, image and history information, and discriminative decoding:

th train.lua -encoder mn-ques-im-hist -decoder disc -gpuid 0

Note: For attention based encoders, set both

command line params, feature dimensions are interpreted as
(batch X spatial X spatial X feature)
. For other encoders,
is redundant.

The training script saves model snapshots at regular intervals in the


It takes about 15-20 epochs to train models with generative decoding to convergence, and 4-8 epochs for discriminative decoding.


We evaluate model performance by where it ranks human response given 100 response options for every round of dialog, based on retrieval metrics — mean reciprocal rank, [email protected], [email protected], [email protected], mean rank.

Model evaluation can be run using:

th evaluate.lua -loadPath checkpoints/model.t7 -gpuid 0

Note that evaluation requires image features

, tokenized dialogs
and vocabulary mappings

Running Beam Search & Visualizing Results

We also include code for running beam search on your model snapshots. This gives significantly nicer results than argmax decoding, and can be run as follows:

th generate.lua -loadPath checkpoints/model.t7 -maxThreads 50

This would compute predictions for 50 threads from the

split and save results in
cd vis
# python 3.6
python -m http.server
# python 2.7
# python -m SimpleHTTPServer

Now visit

in your browser to see generated results.

Sample results from HRE-QIH-G available here.

Download Extracted Features & Pretrained Models


Extracted features for v0.9 train and val are available for download.

Pretrained models

Trained on v0.9

, results on v0.9
Encoder Decoder CNN MRR [email protected] [email protected] [email protected] MR Download
lf-ques gen VGG-16 0.5048 0.3974 0.6067 0.6649 17.8003 lf-ques-gen-vgg16-18
lf-ques-hist gen VGG-16 0.5099 0.4012 0.6155 0.6740 17.3974 lf-ques-hist-gen-vgg16-18
lf-ques-im gen VGG-16 0.5206 0.4206 0.6165 0.6760 17.0578 lf-ques-im-gen-vgg16-22
lf-ques-im-hist gen VGG-16 0.5146 0.4086 0.6205 0.6828 16.7553 lf-ques-im-hist-gen-vgg16-26
lf-att-ques-im-hist gen VGG-16 0.5354 0.4354 0.6355 0.6941 16.7663 lf-att-ques-im-hist-gen-vgg16-80
hre-ques-hist gen VGG-16 0.5089 0.4000 0.6154 0.6739 17.3618 hre-ques-hist-gen-vgg16-18
hre-ques-im-hist gen VGG-16 0.5237 0.4223 0.6228 0.6811 16.9669 hre-ques-im-hist-gen-vgg16-14
hrea-ques-im-hist gen VGG-16 0.5238 0.4213 0.6244 0.6842 16.6044 hrea-ques-im-hist-gen-vgg16-24
mn-ques-hist gen VGG-16 0.5131 0.4057 0.6176 0.6770 17.6253 mn-ques-hist-gen-vgg16-102
mn-ques-im-hist gen VGG-16 0.5258 0.4229 0.6274 0.6874 16.9871 mn-ques-im-hist-gen-vgg16-78
mn-att-ques-im-hist gen VGG-16 0.5341 0.4354 0.6318 0.6903 17.0726 mn-att-ques-im-hist-gen-vgg16-100
lf-ques disc VGG-16 0.5491 0.4113 0.7020 0.7964 7.1519 lf-ques-disc-vgg16-10
lf-ques-hist disc VGG-16 0.5724 0.4319 0.7308 0.8251 6.2847 lf-ques-hist-disc-vgg16-8
lf-ques-im disc VGG-16 0.5745 0.4331 0.7398 0.8340 5.9801 lf-ques-im-disc-vgg16-12
lf-ques-im-hist disc VGG-16 0.5911 0.4490 0.7563 0.8493 5.5493 lf-ques-im-hist-disc-vgg16-8
lf-att-ques-im-hist disc VGG-16 0.6079 0.4692 0.7731 0.8635 5.1965 lf-att-ques-im-hist-disc-vgg16-20
hre-ques-hist disc VGG-16 0.5668 0.4265 0.7245 0.8207 6.3701 hre-ques-hist-disc-vgg16-4
hre-ques-im-hist disc VGG-16 0.5818 0.4461 0.7373 0.8342 5.9647 hre-ques-im-hist-disc-vgg16-4
hrea-ques-im-hist disc VGG-16 0.5821 0.4456 0.7378 0.8341 5.9646 hrea-ques-im-hist-disc-vgg16-4
mn-ques-hist disc VGG-16 0.5831 0.4388 0.7507 0.8434 5.8090 mn-ques-hist-disc-vgg16-20
mn-ques-im-hist disc VGG-16 0.5971 0.4562 0.7627 0.8539 5.4218 mn-ques-im-hist-disc-vgg16-12
mn-att-ques-im-hist disc VGG-16 0.6082 0.4700 0.7724 0.8623 5.2930 mn-att-ques-im-hist-disc-vgg16-28


Extracted features for v1.0 train, val and test are available for download.

Pretrained models

Trained on v1.0

+ v1.0
, results on v1.0
. Leaderboard here.
Encoder Decoder CNN NDCG MRR [email protected] [email protected] [email protected] MR Download
lf-ques-im-hist gen VGG-16 0.5121 0.4568 35.08 55.92 64.02 18.8140 lf-ques-im-hist-gen-vgg16-24
hre-ques-im-hist gen VGG-16 0.5245 0.4561 34.78 56.18 63.72 18.7778 hre-ques-im-hist-gen-vgg16-20
mn-ques-im-hist gen VGG-16 0.5280 0.4580 35.05 56.35 63.92 19.3128 mn-ques-im-hist-gen-vgg16-92
lf-att-ques-im-hist gen VGG-16 0.5362 0.4697 36.58 57.40 64.48 18.9550 lf-att-ques-im-hist-gen-vgg16-82
mn-att-ques-im-hist gen VGG-16 0.5367 0.4650 36.00 56.80 64.25 19.3470 mn-att-ques-im-hist-gen-vgg16-100
lf-ques-im-hist disc VGG-16 0.4531 0.5542 40.95 72.45 82.83 5.9532 lf-ques-im-hist-disc-vgg16-8
hre-ques-im-hist disc VGG-16 0.4546 0.5416 39.93 70.45 81.50 6.4082 hre-ques-im-hist-disc-vgg16-4
mn-ques-im-hist disc VGG-16 0.4750 0.5549 40.98 72.30 83.30 5.9245 mn-ques-im-hist-disc-vgg16-12
lf-att-ques-im-hist disc VGG-16 0.4976 0.5707 42.08 74.82 85.05 5.4092 lf-att-ques-im-hist-disc-vgg16-24
mn-att-ques-im-hist disc VGG-16 0.4958 0.5690 42.42 74.00 84.35 5.5852 mn-att-ques-im-hist-disc-vgg16-24



We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.