Cross-Modal-Projection-Learning

TensorFlow implementation of Deep Cross-Modal Pojection Learning for Image-Text Matching accepted by ECCV 2018.

Introduction

We propose a cross-modal projection matching (CMPM) loss and a cross-modal projection classication (CMPC) loss for learning discriminative image-text embeddings.

CMPL

Requirements

TensorFlow 1.4.0
CUDA 8.0 and cuDNN 6.0
Python 2.7

Usage

Data Preparation

Please download Flickr30k Dataset (About 4.4GB)
Please download JSON Annotations
Convert the Flickr30k image-text data into TFRecords (About 15GB)
```
cd builddata & sh scripts/format_and_convert_flickr.sh 0
```

Training

Please Download Pretrained ResNet-v1-152 checkpoint
Train CMPM with ResNet-152 + Bi-LSTM on Flickr30k
```
sh scripts/train_flickr_cmpm.sh 0
```
Train CMPM + CMPC with ResNet-152 + Bi-LSTM on Flickr30k
```
sh scripts/train_flickr_cmpm_cmpc.sh 0
```

Testing

Compute R@K(k=1,5,10) for image-to-text and text-to-image retrieval evaluation on Flickr30k
```
sh scripts/test_flickr_cmpm.sh 0
```

More Datasets

We also provide the code for MSCOCO and CUHK-PEDES, which has similar preparation&training&testing procedures with Flickr30k
Be careful with the disk space (The MSCOCO may cost 20.1GB for images and 77.6GB for TFRecords)

Citation

If you find CMPL useful in your research, please kindly cite our paper:

@inproceedings{ying2018CMPM,
    author = {Ying Zhang and Huchuan Lu},
    title = {Deep Cross-Modal Projection Learning for Image-Text Matching},
    booktitle = {ECCV},
    year = {2018}}

Contact

If you have any questions, please feel free to contact zydl0907@mail.dlut.edu.cn