Convolutional Neural Networks (CNNs) have demonstrated its great performance in various vision tasks, such as image classification and object detection. However, there are still some areas that are untouched, such as visual tracking. We believe that the biggest bottleneck of applying CNN for visual tracking is lack of training data. The power of CNN usually relies on huge (possible millions) training data, however in visual tracking we only have one labeled sample in the first frame. In this paper, we address this issue by transferring rich feature hierarchies from an offline pretrained CNN into online tracking. In online tracking, the CNN is also finetuned to adapt to the appearance of the tracked target specified in the first frame of the video. We evaluate our proposed tracker on the open benchmark and a non-rigid object tracking dataset. Our tracker demonstrates substantial improvements over the other state-of-the-art trackers.
View on arXiv