Robust Large-scale Video Synchronization without Annotations

Aligning video sequences is a fundamental yet still unsolved component for a wide range of applications in computer graphics and vision. However, most image processing methods cannot be directly applied to related video problems due to the high amount of underlying data, in our case of 1.75 TB raw video data. Using recent advances in deep learning, we present a scalable and robust method for detecting and computing optimal non-linear temporal video alignments. The presented algorithm learns to retrieve and match similar video frames from input sequences without any human interaction or additional label annotations. An iterative scheme is presented which leverages on the nature of the videos themselves in order to remove the need for labels. While previous methods are limited to short video sequences and assume similar settings in vegetation, season and illumination, our approach is able to robustly align videos from data recorded months apart.
View on arXiv