Image Captioning

Introduction

Build a model to generate captions from images. When given an image, the model is able to describe in English what is in the image. In order to achieve this, our model is comprised of an encoder which is a CNN and a decoder which is an RNN. The CNN encoder is given images for a classification task and its output is fed into the RNN decoder which outputs English sentences.

The model and the tuning of its hyperparamaters are based on ideas presented in the paper Show and Tell: A Neural Image Caption Generator and Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.

We use the Microsoft Common Objects in COntext (MS COCO) dataset for this project. It is a large-scale dataset for scene understanding. The dataset is commonly used to train and benchmark object detection, segmentation, and captioning algorithms. For instructions on downloading the data, see the Data section below.

Code

The code can be categorized into two groups:

1) Notebooks - The main code for the project is structured as a series of Jupyter notebooks:

2) Helper files - Contain helper code for the notebooks:

Setup

  1. Clone the COCO API repo into this project's directory:

    git clone https://github.com/cocodataset/cocoapi.git
  2. Setup COCO API (also described in the readme here):

    cd cocoapi/PythonAPI
    make
    cd ..
  3. Install PyTorch (4.0 recommended) and torchvision.

    • Linux or Mac:
      conda install pytorch torchvision -c pytorch 
    • Windows:
      conda install -c peterjc123 pytorch-cpu
      pip install torchvision
  4. Others:

Data

Download the following data from the COCO website, and place them, as instructed below, into the cocoapi subdirectory located inside this project's directory (the subdirectory was created when cloning the COCO API repo as shown in the Setup section above):

Run

To run any script file, use:

python <script.py>

To run any IPython Notebook, use:

jupyter notebook <notebook_name.ipynb>