This project tries to implement RecNet proposed in Reconstruction Network for Video Captioning [1], CVPR 2018.
$ virtualenv .env
$ source .env/bin/activate
(.env) $ pip install --upgrade pip
(.env) $ pip install -r requirements.txt
Extract Inception-v4 [2] features from datasets, and locate them at <PROJECT ROOT>/<DATASET>/features/<DATASET>_InceptionV4.hdf5
. I extracted the Inception-v4 features from here.
Dataset | Inception-v4 |
---|---|
MSVD | link |
MSR-VTT | link |
Split the dataset along with the official splits by running following:
(.env) $ python -m splits.MSVD
(.env) $ python -m splits.MSR-VTT
Clone evaluation codes from the official coco-evaluation repo.
(.env) $ git clone https://github.com/tylin/coco-caption.git
(.env) $ mv coco-caption/pycocoevalcap .
(.env) $ rm -rf coco-caption
Stage 1 (Encoder-Decoder)
(.env) $ python train.py -c configs.train_stage1
Stage 2 (Encoder-Decoder-Reconstructor
Set the pretrained_decoder_fpath
of TrainConfig
in configs/train_stage2.py
as the checkpoint path saved at stage 1, then run
(.env) $ python train.py -c configs.stage2
You can change some hyperparameters by modifying configs/train_stage1.py
and configs/train_stage2.py
.
ckpt_fpath
of RunConfig
in configs/run.py
.(.env) $ python run.py
* NOTE: As you can see, the performance of RecNet does not outperform SA-LSTM. Better hyperparameters should be found out.
MSVD
Model | BLEU4 | CIDEr | METEOR | ROUGE_L | pretrained |
---|---|---|---|---|---|
SA-LSTM | 45.3 | 76.2 | 31.9 | 64.2 | - |
RecNet (global) | 51.1 | 79.7 | 34.0 | 69.4 | - |
RecNet (local) | 52.3 | 80.3 | 34.1 | 69.8 | - |
(Ours) SA-LSTM | 50.9 | 79.6 | 33.4 | 69.6 | link |
(Ours) RecNet (global) | 49.9 | 78.7 | 33.2 | 69.7 | link |
(Ours) RecNet (local) | 49.8 | 79.4 | 33.2 | 69.6 | link |
MSR-VTT
Model | BLEU4 | CIDEr | METEOR | ROUGE_L | pretrained |
---|---|---|---|---|---|
SA-LSTM | 36.3 | 39.9 | 25.5 | 58.3 | - |
RecNet (global) | 38.3 | 41.7 | 26.2 | 59.1 | - |
RecNet (local) | 39.1 | 42.7 | 26.6 | 59.3 | - |
(Ours) SA-LSTM | 38.0 | 40.2 | 25.6 | 58.1 | link |
(Ours) RecNet (global) | 37.4 | 40.0 | 25.5 | 58.0 | link |
(Ours) RecNet (local) | 37.9 | 40.9 | 25.7 | 58.3 | link |
[1] Wang, Bairui, et al. "Reconstruction Network for Video Captioning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
[2] Szegedy, Christian, et al. "Inception-v4, inception-resnet and the impact of residual connections on learning." AAAI. Vol. 4. 2017.