A Two Stream Baseline on Kinectics dataset

Kinectics Training on 1 GPU in 2 Days

This is PyTorch implementation of two stream network of action classification on Kinetics dataset. We train two streams of networks independently on individual(or stacked) frames of RGB (appearence) and optical flow (flow) as inputs.

Objective of this repository to establish a two stream baseline and ease the training process on such a huge dataset.

Installation
Datasets
Training
Evaluation
TODO
Reference

Installation

Install PyTorch by selecting your environment on the website and running the appropriate command.
Install ffmpeg
Please install cv2 as well for your python. I recommend using anaconda 3.6 and menpo's opnecv3 package.
Clone this repository.
- Note: We currently only support Python 3+ on Linux system
We also support Visdom for visualization of loss and accuracy on subset of validation set during training!
- To use Visdom in the browser:
```
# First install Python server and client 
pip install visdom
# Start the server (probably in a screen or tmux)
python -m visdom.server --port=8097
```
- Then (during training) navigate to http://localhost:8097/ (see the Training section below for more details).

Dataset

Kinetics dataset can be Downloaded using Crawler.
Notes:

Use latest youtube-dl
Some video might be missing but you should be alright, if are able to download around 290K videos.

Preprocess

First we need to extract images out of videos using ffmpeg and resave the annotations, so that annotations are compatible with this code.
You can take help of scripts in prep folder in the repo to do both the things.
You need to compute optical flow images using optical-flow. Compute farneback flow as it is much faster to compute and gives reasonable results. You might want to run multiple processes in parallel.

Training

Download the pretrained weight for InceptionV3 and VGG-16, place them in same directory which will hold pertained models and set global_models_dir in train.py.
By default, we assume that you have downloaded that dataset.
To train the network of your choice simply specify the parameters listed in train.py as a flag or manually change them.

Let's assume that you extracted dataset in /home/user/kinetics/ directory then your train command from the root directory of this repo is going to be:

CUDA_VISIBLE_DEVICES=0 python train.py --root=/home/user/kinetics/ --global_models_dir=/home/user/pretrained-models/
--visdom=True --input_type=rgb --stepvalues=200000,350000 --max_iterations=500000

To train of flow inputs

CUDA_VISIBLE_DEVICES=1 python train.py --root=/home/user/kinetics/ global_models_dir=/home/user/pretrained-models/
--visdom=True --input_type=farneback --stepvalues=250000,400000 --max_iterations=500000

Different paramneter in train.py will result in different performance

Note:
- InceptionV3 occupies almost 8.5GB VRAM on a GPU, raining can take from 2-4 days depending upon the disk, cpu and gpu speed. I used one 1080Ti gpu, SSD-PCIe hard-drive and an i7 cpu. Disk operation could be a bottleneck if you are using HDD.
- For instructions on Visdom usage/installation, see the Installation section. By default it is off.
- If you don't like to use visdom then you always keep track of train using logfile which is saved under save_root directory
- During training checkpoint is saved every 25K iteration also log it's frame-level top1 & top3 accuracies on a subset of 95k validation images.
- We recommend to training for 500K iterations for all the input types.

Evaluation

You can use test.py to generate frame-level scores and save video-level results in json file. Further use eval.py to evaluate results on validation set

produce frame-level scores

Once you have trained network then you can use test.py to generate frame-level scores. Simply specify the parameters listed in test.py as a flag or manually change them. for e.g.:

CUDA_VISIBLE_DEVICES=0 python3 test.py --root=/home/user/kinetics/ --input=rgb --test-iteration=500000

-Note

By default it will compute frame-level scores and store them as well as compute frame-level top1 & top3 accuracies using model from 60K-th iteration.
There is a log file file created for frame-level evaluation.

Video-level evaluation

Video-level labling requires frame-level scores. test.py not only store frame-level score but also video-level scores in evaluate function within. It will dump the video level output in json format (same a used in activtiyNet challenge) for validation set. Now you can specify the parameter in eval.py and evaluate

Performance

Table below records the performance of resnet101 model on Mini-Kinetics datasets. It is trained for 60K iteration with learning rate of 0.0005 and a drop by factor of 10 after 25000,40000,55000. Batch size used is 64.

method	frame-top1	frame-top3	video-top1	video-top5	video-AVG	video-mAP
Resnet101-RGB	61.5	77.9	75.7	92.2	83.9	78.1

Extras (comming soon)

Pre-trained models can be downloaded from the links given below. You will need to make changes in test.py to accept the downloaded weights.

Download pre-trained networks

Currently, we provide the following PyTorch models:
- InceptionV3 trained on kinectics ; available from my google drive
  - appearence model trained on rgb-images (named rgb_OneFrame_model_500000)
  - accurate flow model trained on farneback-images (named farneback_OneFrame_model_500000)

TODO

fill the table with fused results

References

[1] Kay, Will, et al. "The Kinetics Human Action Video Dataset." arXiv preprint arXiv:1705.06950 (2017).