This repository contains the official source code used to produce the results reported in the following papers:
Hierarchy-based Image Embeddings for Semantic Image Retrieval.
Björn Barz and Joachim Denzler.
IEEE Winter Conference on Applications of Computer Vision (WACV), 2019.
Deep Learning on Small Datasets without Pre-Training usine Cosine Loss.
Björn Barz and Joachim Denzler.
IEEE Winter Conference on Applications of Computer Vision (WACV), 2020.
If you use this code, please cite one of those papers (the first one when you work with hierarchy-based semantic embeddings, the second one when you use the cosine loss for classification).
The remainder of this ReadMe will focus on learning hierarchy-based semantic image embeddings (the first of the papers mentioned above). If you came here for more information on how we obtained the results reported in the paper about using the cosine loss for classification on small datasets (the second one), you can find those here.
Features extracted and aggregated from the last convolutional layer of deep neural networks trained for classification have proven to be useful image descriptors for a variety of tasks, e.g., transfer learning and image retrieval. Regarding content-based image retrieval, it is often claimed that visually similar images are clustered in this feature space. However, there are two major problems with this approach:
Hierarchy-based semantic embeddings overcome these issues by embedding images into a feature space where the dot product corresponds directly to semantic similarity.
To this end, the semantic similarity between classes is derived from a class taxonomy specifying is-a relationships between classes. These pair-wise similarities are then used for explicitly computing optimal target locations for all classes in the feature space. Finally, a CNN is trained to maximize the correlation between all images and the embedding of their respective class.
The learning process is divided into two steps:
In the following, we provide a step-by-step example for the CIFAR-100 dataset.
We derived a class taxonomy for CIFAR-100 from WordNet, but took care that our taxonomy is a tree, which is required for our method.
The hierarchy is encoded in the file Cifar-Hierarchy/cifar.parent-child.txt as a set of
parent child tuples.
For example the line
100 53 specifies that class 53 is a direct child of class 100.
A more human-readable version of the hierarchy can be found in Cifar-Hierarchy/hierarchy.txt and can be translated into the encoded version using Cifar-Hierarchy/encode_hierarchy.py
Given this set of parent-child tuples, target class embeddings can be computed as follows:
python compute_class_embedding.py \ --hierarchy Cifar-Hierarchy/cifar.parent-child.txt \ --out embeddings/cifar100.unitsphere.pickle
If your hierarchy contains child-parent instead of parent-child tuples or the class labels are strings instead of integers, you can pass the arguments
By default, the number of features in the embedding space equals the number of classes. If you would like to have an embedding space with less dimensions that approximates the semantic relationships between classes, specify the desired number of feature dimensions with
--num_dim and also pass
The result will be a pickle file containing a dictionary with the following items:
embedding: a numpy array whose rows are the embeddings of the classes.
ind2label: a list with the original labels of all classes, in the order corresponding to the rows of
label2ind: a dictionary mapping labels to the corresponding row index in
Hierarchies for the North American Birds and ILSVRC 2012 datasets can be found in NAB-Hierarchy/hierarchy.txt and ILSVRC/wordnet.parent-child.mintree.txt.
The corresponding pre-computed embeddings are stored in
For ILSVRC, three different taxonomies are provided:
After having computed the target class embeddings based on the hierarchy, we can start with training a CNN for mapping the images from the training dataset onto the embeddings of their respective classes. First, download CIFAR-100 from here and extract it to some directory. Then run:
python learn_image_embeddings.py \ --dataset CIFAR-100 \ --data_root /path/to/your/cifar/directory \ --embedding embeddings/cifar100.unitsphere.pickle \ --architecture resnet-110-wfc \ --cls_weight 0.1 \ --model_dump cifar100-embedding.model.h5 \ --feature_dump cifar100-features.pickle
This will train a variant of ResNet-100 with twice the number of channels per block for 372 epochs using Stochastic Gradient Descent with Warm Restarts (SGDR).
Thus, it is normal to see a drop of performance after epochs 12, 36, 84, and 180, where the restarts happen.
The resulting model will be stored as
cifar100-embedding.model.h5 and pre-computed features for the test dataset will be written to the pickle file
This method trains a network with a combination of two objectives: an embedding loss and a classification loss.
For training with the embedding loss only, just omit the
A set of pre-trained models can be found below.
To evaluate your learned model and features, two scripts are provided.
Several retrieval metrics for the features learned in the previous step can be computed as follows:
python evaluate_retrieval.py \ --dataset CIFAR-100 \ --data_root /path/to/your/cifar/directory \ --hierarchy Cifar-Hierarchy/cifar.parent-child.txt \ --classes_from embeddings/cifar100.unitsphere.pickle \ --feat cifar100-features.pickle \ --label "Semantic Embeddings"
This provides hierarchical precision in terms of two different measures for semantic similarity between classes: Wu-Palmer Similarity (WUP) and the height of the lowest common subsumer (LCSH). We used the latter in our paper.
If you want to obtain mAHP@250, as in the paper, instead of mAHP over the entire ranking, pass
--clip_ahp 250 in addition.
The classification accuracy can be evaluated as follows:
python evaluate_classification_accuracy.py \ --dataset CIFAR-100 \ --data_root /path/to/your/cifar/directory \ --hierarchy Cifar-Hierarchy/cifar.parent-child.txt \ --classes_from embeddings/cifar100.unitsphere.pickle \ --architecture resnet-100-fc \ --batch_size 100 \ --model cifar100-embedding.model.h5 \ --layer prob \ --prob_features yes \ --label "Semantic Embeddings with Classification"
If you have learned a semantic embedding model without classification objective, you can perform classification by assigning samples to the nearest class embedding as follows:
python evaluate_classification_accuracy.py \ --dataset CIFAR-100 \ --data_root /path/to/your/cifar/directory \ --hierarchy Cifar-Hierarchy/cifar.parent-child.txt \ --classes_from embeddings/cifar100.unitsphere.pickle \ --architecture resnet-100-fc \ --batch_size 100 \ --model cifar100-embedding.model.h5 \ --layer l2norm \ --centroids embeddings/cifar100.unitsphere.pickle \ --label "Semantic Embeddings"
The following values can be specified for
images. Despite the lack of any suffix, this dataset interface is equivalent to
NAB-large, just with the CUB data. That means, the input image size is 448x448.
To all datasets except CIFAR, one of the following suffixes may be appended:
-ilsvrcmean: use mean and standard deviation from the ILSVRC dataset for pre-processing.
-caffe: Caffe-style pre-processing (i.e., BGR channel ordering and no normalization of standard deviation).
For ILSVRC, you need to move the test images into sub-directories for each class. This script could be used for this, for example.
Own dataset interfaces can be defined by creating a new module in the
datasets package, defining a class derived from
FileDatasetGenerator, importing it in
datasets/__init__.py, and adding a branch for it in the
get_data_generator function defined there.
--max_decay 0.1in addition to the other arguments provided above for the ResNet. This causes a continuous decay of the learning rate so that the final learning rate will be 10 times less than the initial one.
resnet-110, but always with a fully-connected layer after the global average pooling, even when used for learning embeddings.
resnet-110-fcwith twice the number of channels per block.
For ImageNet and NABirds:
For ImageNet and NABirds:
keras-applications >= 1.0.7.
The previous sections have shown in detail how to learn semantic image embeddings for CIFAR-100. In the following, we provide the calls to learn_image_embeddings.py that we used to train our semantic embedding models (including classification objective) on the ILSVRC 2012 and NABirds datasets.
# ILSVRC python learn_image_embeddings.py \ --dataset ILSVRC \ --data_root /path/to/imagenet/ \ --embedding embeddings/imagenet_mintree.unitsphere.pickle \ --architecture resnet-50 \ --loss inv_corr \ --cls_weight 0.1 \ --lr_schedule SGDR \ --sgdr_base_len 80 \ --epochs 80 \ --max_decay 0 \ --batch_size 128 \ --gpus 2 \ --model_dump imagenet_unitsphere-embed+cls_rn50.model.h5 # NAB (from scratch) python learn_image_embeddings.py \ --dataset NAB \ --data_root /path/to/nab/ \ --embedding embeddings/nab.unitsphere.pickle \ --architecture resnet-50 \ --loss inv_corr \ --cls_weight 0.1 \ --lr_schedule SGDR \ --sgdr_max_lr 0.5 \ --max_decay 0 \ --epochs 180 \ --batch_size 128 \ --gpus 2 \ --read_workers 10 \ --queue_size 20 \ --model_dump nab_unitsphere-embed+cls_rn50.model.h5 # NAB (fine-tuned) python learn_image_embeddings.py \ --dataset NAB-ilsvrcmean \ --data_root /path/to/nab/ \ --embedding embeddings/nab.unitsphere.pickle \ --architecture resnet-50 \ --loss inv_corr \ --cls_weight 0.1 \ --finetune imagenet_unitsphere-embed+cls_rn50.model.h5 \ --finetune_init 8 \ --lr_schedule SGDR \ --sgd_lr 0.1 \ --sgdr_max_lr 0.5 \ --max_decay 0 \ --epochs 180 \ --batch_size 128 \ --gpus 2 \ --read_workers 10 \ --queue_size 20 \ --model_dump nab_unitsphere-embed+cls_rn50_finetuned.model.h5
|Dataset||Model||Input Size||mAHP@250||Balanced Accuracy|
|NABirds||ResNet-50 (from scratch)||224x224||73.99%||59.46%|
|NABirds||ResNet-50 (from scratch)||448x448||82.33%||70.43%|
|CUB||ResNet-50 (from scratch)||448x448||83.33%||70.14%|
* This is an updated model with slightly better performance than reported in the paper (~1 percent point). The original model can be obtained here.
The pre-trained models provided above assume input images to be given in RGB color format and standardized by subtracting a dataset-specific channel-wise mean and dividing by a dataset-specific standard deviation. The means and standard deviations for each dataset are provided in the following table.
|NABirds (from scratch)||
|CUB (from scratch)||
|ILSVRC & fine-tuned models||
Sometimes, loading of the pre-trained models fails with the error message "unknown opcode".
In the case of this or other issues, you can still create the architecture yourself and load the pre-trained weights from the model files provided above.
For CIFAR-100 and the
resnet-110-wfc architecture, for example, this can be done as follows:
import keras import utils from learn_image_embeddings import cls_model model = utils.build_network(100, 'resnet-110-wfc') model = keras.models.Model( model.inputs, keras.layers.Lambda(utils.l2norm, name = 'l2norm')(model.output) ) model = cls_model(model, 100) model.load_weights('cifar_unitsphere-embed+cls_resnet-110-wfc.model.h5')