Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles
- SSL

In this paper we study the problem of image representation learning without human annotation. Following the principles of self-supervision, we build a convolutional neural network (CNN) that can be trained to solve Jigsaw puzzles as a pretext task, which requires no manual labeling, and then later repurposed to solve object classification and detection. To maintain the compatibility across tasks we introduce the context-free network (CFN), a Siamese-ennead CNN. The CFN takes image tiles as input and explicitly limits the receptive field (or context) of its early processing units to one tile at a time. We show that the CFN is a more compact version of AlexNet, but with the same semantic learning capabilities. By training the CFN to solve Jigsaw puzzles, we learn both a feature mapping of object parts as well as their cor-rect spatial arrangement. Our experimental evaluations show that the learned features capture semantically relevant content. The performance in object detection of features extracted from the CFN is the highest (51.8%) among unsupervisedly trained features, and very close to that of supervisedly trained features (56.5%). In object classification the CFN features achieve also the best accuracy (38.1%) among unsupervisedly trained features on the ImageNet 2012 dataset.
View on arXiv