Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles

30 March 2016

Abstract

In this paper we study the problem of image representation learning without human annotation. Following the principles of self-supervision, we build a convolutional neural network (CNN) that can be trained to solve Jigsaw puzzles as a pretext task, which requires no manual labeling, and then later repurposed to solve object classification and detection. To maintain the compatibility across tasks we introduce the context-free network (CFN), a Siamese-ennead CNN. The CFN takes image tiles as input and explicitly limits the receptive field (or context) of its early processing units to one tile at a time. We show that the CFN is a more compact version of AlexNet, but with the same semantic learning capabilities. By training the CFN to solve Jigsaw puzzles, we learn both a feature mapping of object parts as well as their cor-rect spatial arrangement. Our experimental evaluations show that the learned features capture semantically relevant content. The performance in object detection of features extracted from the CFN is the highest (51.8%) among unsupervisedly trained features, and very close to that of supervisedly trained features (56.5%). In object classification the CFN features achieve also the best accuracy (38.1%) among unsupervisedly trained features on the ImageNet 2012 dataset.

View on arXiv

Comments on this paper