A simple modification method for single-stage generic object detection neural networks, such as YOLO and SSD, is proposed, which allows for improving the detection accuracy on video data by exploiting the temporal behavior of the scene in the detection pipeline. It is shown that, using this method, the detection accuracy of the base network can be considerably improved, especially for occluded and hidden objects. It is shown that a modified network is more prone to detect hidden objects with more confidence than an unmodified one. A weakly supervised training method is proposed, which allows for training a modified network without requiring any additional annotated data.
View on arXiv