Learning to localize objects with minimal supervision is an important problem in computer vision, since large fully annotated datasets are extremely costly to obtain. In this paper, we propose a new method that achieves this goal with only image-level labels of whether the objects are present or not. Our approach combines a discriminative submodular cover problem for automatically discovering a set of positive object windows with a smoothed latent SVM formulation. The latter allows us to leverage efficient Quasi-Newton optimization techniques. Our experiments demonstrate that the proposed approach provides approximately 70% relative improvement in average precision over the current state of the art on standard benchmark datasets.
View on arXiv