Tutorial about 3D convolutional network

For me, Artificial Intelligence is like a passion and I am trying to use it to solve some daily life problems. In this tutorial/project, I want to give some intuitions to the readers about how 3D convolutional neural networks are actually working. There are not a lot of tutorial about 3D convolutional neural networks, and not of a lot of them investigate the logic behind these networks.

"Standard" convolutional network

Before to dive into 3D CNN, let's summarize together what we know about ConvNets. ConvNets consists mainly in 2 parts:

CNN

In order to extract such features, ConvNets use 2D convolution operations.

conv2D

Why do we need a 3D CNN?

Traditionally, ConvNets are targeting RGB images (3 channels). The goal of 3D CNN is to take as input a video and extract features from it. When ConvNets extract the graphical characteristics of a single image and put them in a vector (a low-level representation), 3D CNNs extract the graphical characteristics of a set of images. 3D CNNs takes in to account a temporal dimension (the order of the images in the video). From a set of images, 3D CNNs find a low-level representation of a set of images, and this representation is useful to find the right label of the video (a given action is performed)

In order to extract such features, 3D convolution uses 3Dconvolution operations.

conv3D

Is 3D CNN the only solution to video classification?

There are several existing approaches to tackle the video classification. This is a nonexaustive list of existing approaches:

What is the main drawback of 3D CNN?

From my point of view, there are two main drawbacks compared to the other methods:

The dataset

For this tutorial, we will use the 20BN-jester Dataset V1. The 20BN-JESTER dataset is a large collection of densely-labeled video clips that show humans performing pre-defined hand gestures in front of a laptop camera or webcam. You can download it on their website.

After downloading all parts, extract using:

cat 20bn-jester-v1-?? | tar zx

Before to start implementing any network, I advise you to take a quick look of the video (set of images). It is essential that you understand what you need to predict.

The network

As for traditional 2D ConvNet, we net use a set of convolution, max pooling operations to reduce layer after layer the size the of our input data. In this tutorial, the shape of our input is [number of training example in a batch, number of channels, number of images in the video, the height, the width]

The goal is to reduce slowly the graphical dimension (the height and the width) and the temporal dimension (number of images in the video) until you get a vector of dimension [number of training example in a batch, number of channel, 1, 1, 1]. After that, you send this vector to a multi-layers classifier.

Overfitting

In order to reduce the overfitting, I used 3 techniques:

Optimizer

In this tutorial, we use the Adam optimizer with the AMSGRAD optimization described in this paper. Empirically, it tends to improve the convergence of my model. However, the reader should keep in mind that other experiences (from other users) have shown that the AMSGRAD optimization could also lead to worse performance than ADAM.

Vizualization tool

For visualizing the result, I like to use tensorboard. Since tensorboard is not available by default on Pytorch, I used tensorboardX which enable Pytorch to use tensorboard.

How to run this code?

First, you have to set up the config file accordingly (define the right path, the name of your model ...). After you just need to run:

python train.py --config configs/config.json -g 0

For the prediction, run:

python train.py --eval_only True --resume True --config configs/config.json -g 0

-g: the ID of the GPU you are using. If you are using multiple GPU, you can write:

python train.py --config configs/config.json -g 0,1

Note: this code works well for python 3.6 and with pytorch 0.4.

Results

For this network, I got an accuracy of 85% for the validation set. Since there is an accuracy gap between the training set and the validation set, there is plenty of room for improvements.

Below you can find the training curves obtained via tensorboard.

Training

training_loss training_top1 training_top5

Validation

training_loss training_top1 training_top5

Better accuracy

An accuracy of 85% on the validation is nice. However, can we do better?

First, we can observe that the accuracy on the training set is closed to 100%, but the accuracy on the validation set plateau at 85%. There is a difference of 15%, which can be explained by the fact that our model tends to overfit our training dataset. What can you do against that?

Several things can be done:

Ensemble learning

Let's take a simple example. We have 3 binary classifiers (A,B,C) with a 70% accuracy. A,B,C can be based on 3D convolution architecture as mentioned previously.

For a majority vote with 3 members we can expect 4 outcomes:

All three are correct 0.7 0.7 0.7 = 0.3429

Two are correct 0.7 0.7 0.3

Two are wrong 0.3 0.3 0.7

All three are wrong 0.3 0.3 0.3 = 0.027

We see that most of the times (~44%) the majority vote corrects an error. This majority vote ensemble will be correct an average of ~78% (0.3429 + 0.4409 = 0.7838).

Stacking and 3D CNN

In this implementation, we train first 5 3D CNN separately. Each of them has an accuracy of ~84-85% on the validation set. Then, we combine the predictions in order to increase the overall prediction accuracy.

stacking

And we obtain an accuracy of 88.16% on the validation set.

Important remark

Code based off this GulpIO-benchmarks implementation.

If you like you like this project, feel free to leave a star. (it is my only reward ^^)