Learning Visually-Grounded Semantics from Contrastive Adversarial Samples

Developed by Freda Shi and Jiayuan Mao.

This repo includes the implementation of our paper "Learning Visually-Grounded Semantics from Contrastive Adversarial Samples" at COLING 2018.

intro.pdf

Requirements

VSE++

We apply VSE++ (Faghri et al., 2017) as our base model. To reproduce the baseline numbers of VSE++, please follow the instructions here by the authors. We found that their results are easy to reproduce!

Datasets

We use the same datasets as VSE++ (Faghri et al., 2017). Use the following commands to download the data of VSE to the root folder

wget http://www.cs.toronto.edu/~faghri/vsepp/vocab.tar
wget http://www.cs.toronto.edu/~faghri/vsepp/data.tar

and unzip the tars with

tar -xvf vocab.tar
tar -xvf data.tar

You may also need GloVe, as we apply GloVe.840B.300d as the initialization of the word embeddings. We also provide a custom subset of GloVe embeddings at VSE_C/data/glove.pkl.

Reproduce our Experiments

Generate Contrastive Adversarial Samples

The following commands generates specific types of contrastive adversarial samples of sentences. Note that the script will create folders in the initial data path, e.g., ../data/coco_precomp/noun_ex/.

cd adversarial_attack
python3 $TYPE.py --data_path $DATA_PATH --data_name $DATA_NAME

$TYPE can be one of noun, numeral or relation. Here is an example command:

python3 noun.py --data_path ../data --data_name coco_precomp

Train VSE-C

Similar to VSE++, VSE-C supports training with contrastive adversarial samples in the text domain. After obtaining the contrastive adversarial samples, we can train an example noun-typed VSE-C with the following command (after generating noun-typed contrastive adversarial samples):

cd VSE_C
python3 train.py --data_path ../data/ --data_name coco_precomp \
    --logger_name runs/coco_noun --learning_rate 0.001 --text_encoder_type gru \
    --max_violation --worker 10 --img_dim 2048 --use_external_captions

The model will be saved into the logger folder, e.g., runs/resnet152_noun. Please refer to VSE_C/train.py for more detailed description on hyper-parameters. Note that you also need to create

We have tested the model on GPU (CUDA 8.0). If you have any problem on training VSE-C on a different environment, please feel free to make an issue.

Evaluate VSE-C

In-Domain Evaluation

Here shows an example of the in-domain evaluation. Please run the code in Python3 or IPython3.

from VSE_C.vocab import Vocabulary
import VSE_C.evaluation
evaluation.eval_with_single_extended('runs/coco_noun', 'data/', 'coco_precomp', 'test')

Object Alignment

We provide our training script (testing is associated in the evaluation procedure) at evaluation/object_alignment. Please refer to the script for detailed usage.

Saliency Visualization (Jacinle required)

evaluation/saliency_visualization/saliency_visualization.py provides the script for saliency visualization. Please refer to the script for detailed usage. The visualized saliency images will be like:

Sentence Completion (Fill-in-the-Blanks, Jacinle required)

First, we need to generate datasets for sentence completion by

cd evaluations/completion
python3 completion_datamaker.py --input $INPUT_PATH --output $OUTPUT_PATH

Then, run

python3 -m evaluations.completion.completion_train $ARGS
python3 -m evaluations.completion.completion_test $ARGS

for training and testing sentence completion models. Please refer to evaluation scripts for further descriptions of arguments.

Reference

If you find VSE-C useful, please consider citing:

@inproceedings{shi2018learning,
    title={Learning Visually-Grounded Semantics from Contrastive Adversarial Samples},
    author={Shi, Haoyue and Mao, Jiayuan and Xiao, Tete and Jiang, Yuning and Sun, Jian},
    booktitle={Proceedings of the 27th International Conference on Computational Linguistics},
    year={2018}
}

License

MIT