ActiveQA: Active Question Answering

This repo contains code for our paper Ask the Right Questions: Active Question Reformulation with Reinforcement Learning.

Small forewarning, this is still much more of a research codebase than a library. No support is provided.

If you use this code for your research, please cite the paper.

Introduction

ActiveQA is an agent that transforms questions online in order to find the best answers. The agent consists of a Tensorflow model that reformulates questions and an Answer Selection model. It interacts with an environment that contains a question-answering system. The agent queries the environment with variants of a question and calculates a score for the answer against the original question. The model is trained end-to-end using reinforcement learning.

This version addresses the SearchQA question-answering task, and the environment consists of the Bi-directional Attention Flow (BiDAF) model of Seo et al. (2017).

Setup

Dependencies

We require tensorflow and many other supporting libraries. Tensorflow should be installed separately following the docs. To install the other dependencies use

pip install -r requirements.txt

Note: We only ran this code with Python 2, so Python 3 is not officially supported.

Data

Download the source dataset from SearchQA, GloVe, and NLTK corpus and save them in $HOME/data.

export DATA_DIR=$HOME/data
mkdir $DATA_DIR

Download

Download the SearchQA dataset (~600 MB) for training, testing, and validation here: https://drive.google.com/open?id=1OxRhw81g7amW3aBd_iu2By5THysgr2uv

<Download the dataset to $DATA_DIR/SearchQA.zip>
unzip $DATA_DIR/SearchQA.zip -d $DATA_DIR

Download GloVe (~850 MB):

export GLOVE_DIR=$DATA_DIR/glove
mkdir $GLOVE_DIR

wget -c http://nlp.stanford.edu/data/glove.6B.zip -O $GLOVE_DIR/glove.6B.zip
unzip $GLOVE_DIR/glove.6B.zip -d $GLOVE_DIR

Download NLTK (for tokenizer). Make sure that nltk is installed!

python -m nltk.downloader -d $HOME/nltk_data punkt

Download the reformulator model pretrained on UN+Paralex datasets (~140 MB):

export PRETRAINED_DIR=$DATA_DIR/pretrained
mkdir $PRETRAINED_DIR

wget -c https://storage.googleapis.com/pretrained_models/translate.ckpt-1460356.zip -O $PRETRAINED_DIR/translate.ckpt-1460356.zip
unzip $PRETRAINED_DIR/translate.ckpt-1460356.zip -d $PRETRAINED_DIR

Preprocess

The SearchQA dataset requires a 2-step preprocessing:

Convert into SQuAD data format as the model was written to only work with that format.

export SQUAD_DIR=$DATA_DIR/squad
mkdir $SQUAD_DIR

python -m searchqa.prepro \
--searchqa_dir=$DATA_DIR/SearchQA \
--squad_dir=$SQUAD_DIR

Preprocess the SearchQA dataset in SQuAD format (along with GloVe vectors) and save them in $PWD/data/squad (~60 minutes):
```
python -m third_party.bi_att_flow.squad.prepro \
--glove_dir=$GLOVE_DIR \
--source_dir=$SQUAD_DIR
```

Note that Python2 and Python3 handle Unicode differently and hence the preprocessing output differs. For converting the SearchQA format to SQuAD format either version can be used; use Python3 for other datasets.

gRPC

We need to compile the gRPC interface for the Environment Server.

chmod +x compile_protos.sh; ./compile_protos.sh

Run Environment Server

The training requires running the environment gRPC server, which receives queries from the ActiveQA agent and sends back one response per query.

   python -m px.environments.bidaf_server \
   --port=10000 \
   --squad_data_dir=data/squad \
   --bidaf_shared_file=data/bidaf/shared.json \
   --bidaf_model_dir=data/bidaf/

The checkpoint of a BiDAF model trained on SearchQA is already provided in data/bidaf, so you don't have to train one yourself. However, if you want to reproduce our training, clone the BiDAF repository and run

python basic/cli.py \
--mode=trains \
--data_dir=data/squad \
--shared_path=data/bidaf/shared.json \
--init_lr=0.001 \
--num_steps=14000

Reformulator Training

We first train reformulator from a model pretrained on UN and Paralex datasets. It should take a week on a single P100 GPU to reach ~42 F1 score on SearchQA's dev set.

export OUT_DIR=/tmp/active-qa
mkdir $OUT_DIR

export REFORMULATOR_DIR=$OUT_DIR/reformulator
mkdir $REFORMULATOR_DIR

echo "model_checkpoint_path: \"$PRETRAINED_DIR/translate.ckpt-1460356\"" > checkpoint
cp -f checkpoint $REFORMULATOR_DIR
cp -f checkpoint $REFORMULATOR_DIR/initial_checkpoint.txt

python -m px.nmt.reformulator_and_selector_training \
--environment_server_address=localhost:10000 \
--hparams_path=px/nmt/example_configs/reformulator.json \
--enable_reformulator_training=true \
--enable_selector_training=false \
--train_questions=$SQUAD_DIR/train-questions.txt \
--train_annotations=$SQUAD_DIR/train-annotation.txt \
--train_data=data/squad/data_train.json \
--dev_questions=$SQUAD_DIR/dev-questions.txt \
--dev_annotations=$SQUAD_DIR/dev-annotation.txt \
--dev_data=data/squad/data_dev.json \
--glove_path=$GLOVE_DIR/glove.6B.100d.txt \
--out_dir=$REFORMULATOR_DIR \
--tensorboard_dir=$OUT_DIR/tensorboard

Note: if you don't want to wait a week of training, you can download this checkpoint of the reformulator trained on SearchQA, with dev set F1 score of 42.5. Note that this is not the exact model analyzed in the paper, but one with equivalent performance.

Selector Training

After training the reformulator, we can now train the selector. It should take 2-3 days on a single P100 GPU to reach ~47.5 F1 score on SearchQA's dev set.

python -m px.nmt.reformulator_and_selector_training \
--environment_server_address=localhost:10000 \
--hparams_path=px/nmt/example_configs/reformulator.json \
--enable_reformulator_training=false \
--enable_selector_training=true \
--train_questions=$SQUAD_DIR/train-questions.txt \
--train_annotations=$SQUAD_DIR/train-annotation.txt \
--train_data=data/squad/data_train.json \
--dev_questions=$SQUAD_DIR/dev-questions.txt \
--dev_annotations=$SQUAD_DIR/dev-annotation.txt \
--dev_data=data/squad/data_dev.json \
--glove_path=$GLOVE_DIR/glove.6B.100d.txt \
--batch_size_train=16 \
--batch_size_eval=64 \
--save_path=$OUT_DIR/selector \
--out_dir=$REFORMULATOR_DIR \
--tensorboard_dir=$OUT_DIR/tensorboard

Note: If you don't want to wait 2-3 days for the training to finish, you can download a checkpoint of the selector. The checkpoint is trained on SearchQA, achieving an F1 score of ~47.5 on the dev set.

References

This repository relies on the work of the following repositories:

and uses data from the following sources:

BibTex

@inproceedings{buck18,
  author    = {Christian Buck and
               Jannis Bulian and
               Massimiliano Ciaramita and
               Andrea Gesmundo and
               Neil Houlsby and
               Wojciech Gajewski and
               Wei Wang},
  title     = {Ask the Right Questions: Active Question Reformulation with Reinforcement
               Learning},
  booktitle = {Sixth International Conference on Learning Representations (ICLR)},
  year      = {2018},
  month     = {May},
  address   = {Vancouver, Canada},
  url       = {https://openreview.net/forum?id=S1CChZ-CZ},
}