Semi-supervised detection of wrong labels in labeled data set



This algorithm is well suited to validate labelled images obtained with web scrapping, untrusted sources or colloborativly generated labels.

How does it work

This approach separate our image directory between two classes, inliers, and outliers, by describing the images with the bottlenecks values generated by the end of the convolution phase of a pre-trained CNN, like 'inception-v3' or a MobileNets, for faster computation. ( see: Here )

These values are then fed to a clustering algorithm to get a prediction. To increase performance, some values of random images are precomputed, and added during the fit of our classifier.

Basic Usage

Just run the and pass the location of the directory you want to detect like so : python --image_dir=./foo/LocationLabelDir/

You will then see a GUI pop up where you will be able to fine tune the detection and delete/move the selected outliers.


$ pip install -r requirements.txt


Creating your own noisy bottlenecks

You can create your own pollution values, that are more specefic to your problem by running the script See further explanations, and option inside the file.

python --image_dir=./foo/LocationLabelDir/ --architecture=all

Some result


These graphs are generated using a set of cat images (inliers), and a percentage gradualy increasing of dog images.

As expected this method works realy great, for images or labels that our pretrained CNN has seen. So if you have a need specific, or checking a constant data stream, with known outcome of images, I encourage you to use a custom CNN and adapt the code, and/or use the create_noise_bottlenecks script, for better performance.

Furthermore from these graphs the breaking point of our estimator can be estimated at between 40-45 % of noise. This is due to the process of normalizing the predictions, and choosing as inliers the smallest cluster.



In these two graphs, the method as been tested, on a likely data mining source, google image. It achieves a detection of up to 90 % of wrong labels, while minimizing false positives to under 10 % of our set. This can bring your data set, from unsalvageable or too expensive to correct or gather, to something that could be use for training despite residual noise ( see : Here )


Here is a visualisation of this problem after different iso map transformations to validate the process, by looking at the separability of our two clusters and get an intuition of the effect of added pollution images, for better performance.


What could be improved

As of right now, the unsupurvised algorithms used are still limited as they are only clustering data, instead of true semi-supervised, and also reuse information from the user that deleted/moved wrong labels with the GUI.