Image QA

This repository contains code to reproduce results in paper Exploring Models and Data for Image Question Answering. Mengye Ren, Ryan Kiros, Richard Zemel. NIPS 2015 (to appear).

Rendered results

Results for each model can be viewed directly at http://www.cs.toronto.edu/~mren/imageqa/results

Dataset

COCO-QA dataset is released at http://www.cs.toronto.edu/~mren/imageqa/data/cocoqa

Prerequisites

Dependencies

Please install the following dependencies:

Repository structure

The repository contains the following folders:

Data files

Please download the following files from my server:

After downloading the files, please place hidden_oxford_mscoco.h5 inside data folder, extract cocoqa folder inside data.

Now your data folder should contain the following files:

All numpy files above (train, valid, test) stores two objects, the input data and the target value. The input data is 3-d matrix, with first dimension to be number of example, second dimension to be time, third dimension to be feature. The first time step is the image ID, and later the word ID. The target value is the answer class ID. The IDs dictionary can be found in qdict.pkl and ansdict.pkl, which are python pickle files storing the dictionary object. All unseen words in the test set are encoded as 'UNK' and has its own ID. Note that the word ID is 1-based, 0 is reserved for empty word, which has a zero word embedding vector.

Training

After setting up the dataset, call the following command to train a model. For IMG+BOW, {model file} is models/img_bow.model.yml. VIS+LSTM and 2-VIS+BLSTM can also be found in the models folder.

cd src

GNUMPY_USE_GPU={yes|no} python train.py \
-model ../models/{model file} \
-output ../results \
-data ../data/cocoqa \
-config ../config/train.yml \
[-board {gpu board id} (optional)]

While training, it will print some statuses, and here is how to decode them:

First round it will train using only the training set and validate on the hold-out set, to determine the number of epoch to train. Then it will start another job to train the training set plus the hold out set together. It will not print test set performance until everything has been finished.

Reading trained weight matrices

The weights are stored in results folder named {model}-{timestamp}/{model}-{timestamp}.w.npy

If you load the weights in python, it will be a list of arrays. Non-parameterized layers have a single 0 value in the list. For IMG+BOW model, there are only 2 non-zero entries, one is the word embedding matrix, and the other is the softmax weights. The softmax weights have the last row as the bias.

For LSTM weights, the weight for the entire LSTM unit is reshaped into one matrix,

W_I is for the input gate, W_F is for the forget gate, W_Z is for the input transformation, and W_O is for the output gate. The weights for each W has the last row as the bias, i.e. (InDim + 1) x OutDim.