object-localization

This project shows how to localize objects in images by using simple convolutional neural networks.

Dataset

Before getting started, we have to download a dataset and generate a csv file containing the annotations (boxes).

  1. Download The Oxford-IIIT Pet Dataset
  2. Download The Oxford-IIIT Pet Dataset Annotations
  3. tar xf images.tar.gz
  4. tar xf annotations.tar.gz
  5. mv annotations/xmls/* images/
  6. python3 generate_dataset.py

Single-object detection

Example 1: Finding dogs/cats

Architecture

First, let's look at YOLOv2's approach:

  1. Pretrain Darknet-19 on ImageNet (feature extractor)
  2. Remove the last convolutional layer
  3. Add three 3 x 3 convolutional layers with 1024 filters
  4. Add a 1 x 1 convolutional layer with the number of outputs needed for detection

We proceed in the same way to build the object detector:

  1. Choose a model from Keras Applications i.e. feature extractor
  2. Remove the dense layer
  3. Freeze some/all/no layers
  4. Add one/multiple/no convolution block (or _inverted_res_block for MobileNetv2)
  5. Add a convolution layer for the coordinates

The code in this repository uses MobileNetv2, because it is faster than other models and the performance can be adapted. For example, if alpha = 0.35 with 96x96 is not good enough, one can just increase both values (see here for a comparison). If you use another architecture, change preprocess_input.

  1. python3 example_1/train.py
  2. Adjust the WEIGHTS_FILE in example_1/test.py (given by the last script)
  3. python3 example_1/test.py

Result

In the following images red is the predicted box, green is the ground truth:

Image 1

Image 2

Example 2: Finding dogs/cats and distinguishing classes

This time we have to run the scripts example_2/train.py and example_2/test.py.

Changes

In order to distinguish between classes, we have to modify the loss function. I'm using here w_1*log((y_hat - y)^2 + 1) + w_2*FL(p_hat, p) where w_1 = w_2 = 1 are two weights and FL(p_hat, p) = -(0.9(1 - p_hat)^2 p*log(p_hat) + 0.1*p_hat^2(1 - p)log(1-p_hat)) (focal loss).

Instead of using all 37 classes, the code will only output class 0 (contains only class 0) or class 1 (contains class 1 to 36). However, it is easy to extend this to more classes (use categorical cross entropy instead of focal loss and try out different weights).

Multi-object detection

Example 3: Segmentation-like detection

Architecture

In this example, we use a skip-net architecture similar to U-Net. For an in-depth explanation see my blog post.

Architecture

Result

Dog

Example 4: YOLO-like detection

Architecture

This example is based on the three YOLO papers. For an in-depth explanation see this blog post.

Result

Multiple dogs

Guidelines

Improve accuracy (IoU)

Increase training speed

Overfitting