Implementation of "Self-Supervised Learning via Conditional Motion Propagation" (CMP)

Paper

Xiaohang Zhan, Xingang Pan, Ziwei Liu, Dahua Lin, Chen Change Loy, "Self-Supervised Learning via Conditional Motion Propagation", in CVPR 2019 [Project Page]

For further information, please contact Xiaohang Zhan.

Demos (Watching full demos in YouTube)

Conditional motion propagation (motion prediction by guidance)

Guided video generation (draw arrows to let a static image animated)

Semi-automatic annotation (first row: interface, auto zoom-in, mask; second row: optical flows)

Data collection

YFCC frames (45G). YFCC optical flows (LiteFlowNet) (29G). YFCC lists (251M).

Model collection

Pre-trained models for semantic segmentation, instance segmentation and human parsing by CMP can be downloaded here
Models for demos (conditinal motion propagation, guided video generation and semi-automatic annotation) can be downloaded here

Requirements

python>=3.6
pytorch>=0.4.0
others
```
pip install -r requirements.txt
```

Usage

Clone the repo.

git clone git@github.com:XiaohangZhan/conditional-motion-propagation.git
cd conditional-motion-propagation

Representation learning

Prepare data (YFCC as an example)

mkdir data
mkdir data/yfcc
cd data/yfcc
# download YFCC frames, optical flows and lists to data/yfcc
tar -xf UnsupVideo_Frames_v1.tar.gz
tar -xf flow_origin.tar.gz
tar -xf lists.tar.gz

Then folder data looks like:

data
  ├── yfcc
    ├── UnsupVideo_Frames
    ├── flow_origin
    ├── lists
      ├── train.txt
      ├── val.txt

Train CMP for Representation Learning.

If your server supports multi-nodes training.

sh experiments/rep_learning/alexnet_yfcc_voc_16gpu_70k/train.sh # 16 GPUs distributed training
python tools/weight_process.py --config experiments/rep_learning/alexnet_yfcc_voc_16gpu_70k/config.yaml --iter 70000 # extract weights of the image encoder to experiments/rep_learning/alexnet_yfcc_voc_16gpu_70k/checkpoints/convert_iter_70000.pth.tar

If your server does not support multi-nodes training.

sh experiments/rep_learning/alexnet_yfcc_voc_8gpu_140k/train.sh # 8 GPUs distributed training
python tools/weight_process.py --config experiments/rep_learning/alexnet_yfcc_voc_8gpu_140k/config.yaml --iter 140000 # extract weights of the image encoder

Run demos

Download the model and move it to experiments/semiauto_annot/resnet50_vip+mpii_liteflow/checkpoints/.
Launch jupyter notebook and run demos/cmp.ipynb for conditional motion propagation, or demos/demo_annot.ipynb for semi-automatic annotation.

Train the model by yourself (optional)

# data not ready
sh experiments/semiauto_annot/resnet50_vip+mpii_liteflow/train.sh # 8 GPUs distributed training

Results

1. Pascal VOC 2012 Semantic Segmentation (AlexNet)

<table class="table table-condensed">
    <th>Method (AlexNet)</th><th>Supervision (data amount)</th><th>% mIoU</th>
    <tbody>
    <tr><td>Krizhevsky et al. [1]</td><td>ImageNet labels (1.3M)</td><td>48.0</td></tr>
    <tr><td>Random</td><td>- (0)</td><td>19.8</td></tr>
    <tr><td>Pathak et al. [2]</td><td>In-painting (1.2M)</td><td>29.7</td></tr>
    <tr><td>Zhang et al. [3]</td><td>Colorization (1.3M)</td><td>35.6</td></tr>
    <tr><td>Zhang et al. [4]</td><td>Split-Brain (1.3M)</td><td>36.0</td></tr>
    <tr><td>Noroozi et al. [5]</td><td>Counting (1.3M)</td><td>36.6</td></tr>
    <tr><td>Noroozi et al. [6]</td><td>Jigsaw (1.3M)</td><td>37.6</td></tr>
    <tr><td>Noroozi et al. [7]</td><td>Jigsaw++ (1.3M)</td><td>38.1</td></tr>
    <tr><td>Jenni et al. [8]</td><td>Spot-Artifacts (1.3M)</td><td>38.1</td></tr>
    <tr><td>Larsson et al. [9]</td><td>Colorization (3.7M)</td><td>38.4</td></tr>
    <tr><td>Gidaris et al. [10]</td><td>Rotation (1.3M)</td><td>39.1</td></tr>
    <tr><td>Pathak et al. [11]*</td><td>Motion Segmentation (1.6M)</td><td>39.7</td></tr>
    <tr><td>Walker et al. [12]*</td><td>Flow Prediction (3.22M)</td><td>40.4</td></tr>
    <tr><td>Mundhenk et al. [13]</td><td>Context (1.3M)</td><td>40.6</td></tr>
    <tr><td>Mahendran et al. [14]</td><td>Flow Similarity (1.6M)</td><td>41.4</td></tr>
    <tr><td>Ours</td><td>CMP (1.26M)</td><td>42.9</td></tr>
    <tr><td>Ours</td><td>CMP (3.22M)</td><td>44.5</td></tr>
    <tr><td>Caron et al. [15]</td><td>Clustering (1.3M)</td><td>45.1</td></tr>
    <tr><td>Feng et al. [16]</td><td>Feature Decoupling (1.3M)</td><td>45.3</td></tr>
    </tbody>
</table>
<h4>2. Pascal VOC 2012 Semantic Segmentation (ResNet-50)</h4>
<table class="table table-condensed">
    <th>Method (ResNet-50)</th><th>Supervision (data amount)</th><th>% mIoU</th>
    <tbody>
    <tr><td>Krizhevsky et al. [1]</td><td>ImageNet labels (1.2M)</td><td>69.0</td></tr>
    <tr><td>Random</td><td>- (0)</td><td>42.4</td></tr>
    <tr><td>Walker et al. [12]*</td><td>Flow Prediction (1.26M)</td><td>54.5</td></tr>
    <tr><td>Pathak et al. [11]*</td><td>Motion Segmentation (1.6M)</td><td>54.6</td></tr>
    <tr><td>Ours</td><td>CMP (1.26M)</td><td>59.0</td></tr>
    </tbody>
</table>
<h4>3. COCO 2017 Instance Segmentation (ResNet-50)</h4>
<table class="table table-condensed">
    <th>Method (ResNet-50)</th><th>Supervision (data amount)</th><th>Det. (% mAP)</th><th>Seg. (% mAP)</th>
    <tbody>
    <tr><td>Krizhevsky et al. [1]</td><td>ImageNet labels (1.2M)</td><td>37.2</td><td>34.1</td></tr>
    <tr><td>Random</td><td>- (0)</td><td>19.7</td><td>18.8</td></tr>
    <tr><td>Pathak et al. [11]*</td><td>Motion Segmentation (1.6M)</td><td>27.7</td><td>25.8</td></tr>
    <tr><td>Walker et al. [12]*</td><td>Flow Prediction (1.26M)</td><td>31.5</td><td>29.2</td></tr>
    <tr><td>Ours</td><td>CMP (1.26M)</td><td>32.3</td><td>29.8</td></tr>
    </tbody>
</table>
<h4>4. LIP Human Parsing (ResNet-50)</h4>
<table class="table table-condensed">
    <th>Method (ResNet-50)</th><th>Supervision (data amount)</th><th>Single-Person (% mIoU)</th><th>Multi-Person (% mIoU)</th>
    <tbody>
    <tr><td>Krizhevsky et al. [1]</td><td>ImageNet labels (1.2M)</td><td>42.5</td><td>55.4</td></tr>
    <tr><td>Random</td><td>- (0)</td><td>32.5</td><td>35.0</td></tr>
    <tr><td>Pathak et al. [11]*</td><td>Motion Segmentation (1.6M)</td><td>36.6</td><td>50.9</td></tr>
    <tr><td>Walker et al. [12]*</td><td>Flow Prediction (1.26M)</td><td>36.7</td><td>52.5</td></tr>
    <tr><td>Ours</td><td>CMP (1.26M)</td><td>36.9</td><td>51.8</td></tr>
    <tr><td>Ours</td><td>CMP (4.57M)</td><td>40.2</td><td>52.9</td></tr>
    </tbody>
</table>
Note: Methods marked * have not reported the results in their paper, hence we reimplemented them to obtain the results.
<br>
<h4>References</h4>
<ol>
    <li>Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012.</li>
    <li>Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.</li>
    <li>Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV. Springer, 2016.</li>
    <li>Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR, 2017.</li>
    <li>Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. Representation learning by learning to count. In ICCV, 2017.</li>
    <li>Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV. Springer, 2016.</li>
    <li>Mehdi Noroozi, Ananth Vinjimoor, Paolo Favaro, and Hamed Pirsiavash. Boosting self-supervised learning via knowledge transfer. In CVPR, 2018.</li>
    <li>Simon Jenni and Paolo Favaro. Self-supervised feature learning by learning to spot artifacts. In CVPR, 2018.</li>
    <li>Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Colorization as a proxy task for visual understanding. In CVPR, 2017.</li>
    <li>Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.</li>
    <li>Deepak Pathak, Ross B Girshick, Piotr Dollar, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In CVPR, 2017.</li>
    <li>Jacob Walker, Abhinav Gupta, and Martial Hebert. Dense optical flow prediction from a static image. In ICCV, 2015.</li>
    <li>T Nathan Mundhenk, Daniel Ho, and Barry Y Chen. Improvements to context based self-supervised learning. CVPR, 2018.</li>
    <li>A. Mahendran, J. Thewlis, and A. Vedaldi. Cross pixel optical flow similarity for self-supervised learning. In ACCV, 2018.</li>
    <li>Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.</li>
    <li>Zeyu Feng, Chang Xu, and Dacheng Tao. Self-Supervised Representation Learning by Rotation Feature Decoupling. In CVPR, 2019.</li>
</ol>

Core idea

A Chinese proverb: "牵一发而动全身".

Bibtex

@inproceedings{zhan2019self,
 author = {Zhan, Xiaohang and Pan, Xingang and Liu, Ziwei and Lin, Dahua and Loy, Chen Change},
 title = {Self-Supervised Learning via Conditional Motion Propagation},
 booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)},
 month = {June},
 year = {2019}
}