This is PyTorch implementation of two stream network of action classification on Kinetics dataset. We train two streams of networks independently on individual(or stacked) frames of RGB (appearence) and optical flow (flow) as inputs.
Objective of this repository to establish a two stream baseline and ease the training process on such a huge dataset.
ffmpeg
# First install Python server and client
pip install visdom
# Start the server (probably in a screen or tmux)
python -m visdom.server --port=8097
Kinetics dataset can be Downloaded using Crawler.
Notes:
First we need to extract images out of videos using ffmpeg
and resave the annotations,
so that annotations are compatible with this code.
You can take help of scripts in prep
folder in the repo to do both the things.
You need to compute optical flow images using optical-flow.
Compute farneback
flow as it is much faster to compute and gives reasonable results.
You might want to run multiple processes in parallel.
global_models_dir
in train.py
. train.py
as a flag or manually change them.Let's assume that you extracted dataset in /home/user/kinetics/
directory then your train command from the root directory of this repo is going to be:
CUDA_VISIBLE_DEVICES=0 python train.py --root=/home/user/kinetics/ --global_models_dir=/home/user/pretrained-models/
--visdom=True --input_type=rgb --stepvalues=200000,350000 --max_iterations=500000
To train of flow inputs
CUDA_VISIBLE_DEVICES=1 python train.py --root=/home/user/kinetics/ global_models_dir=/home/user/pretrained-models/
--visdom=True --input_type=farneback --stepvalues=250000,400000 --max_iterations=500000
Different paramneter in train.py
will result in different performance
top1 & top3
accuracies on a subset of 95k validation images.You can use test.py
to generate frame-level scores and save video-level results in json file.
Further use eval.py
to evaluate results on validation set
Once you have trained network then you can use test.py
to generate frame-level scores.
Simply specify the parameters listed in test.py
as a flag or manually change them. for e.g.:
CUDA_VISIBLE_DEVICES=0 python3 test.py --root=/home/user/kinetics/ --input=rgb --test-iteration=500000
-Note
top1 & top3
accuracies using model from 60K-th iteration.Video-level labling requires frame-level scores.
test.py
not only store frame-level score but also video-level scores in evaluate
function within. It will dump the video level output in json format
(same a used in activtiyNet challenge) for validation set.
Now you can specify the parameter in eval.py
and evaluate
Table below records the performance of resnet101
model on Mini-Kinetics datasets. It is trained for 60K iteration with learning rate of 0.0005
and a drop by factor of 10 after 25000,40000,55000
.
Batch size used is 64.
method | frame-top1 | frame-top3 | video-top1 | video-top5 | video-AVG | video-mAP |
Resnet101-RGB | 61.5 | 77.9 | 75.7 | 92.2 | 83.9 | 78.1 |
Pre-trained models can be downloaded from the links given below.
You will need to make changes in test.py
to accept the downloaded weights.
rgb_OneFrame_model_500000
)farneback_OneFrame_model_500000
)