Untrimmed videos on social media or those captured by robots and surveillance cameras are of varied aspect ratios. However, 3D CNNs require a square-shaped video whose spatial dimension is smaller than the original one. Random or center-cropping techniques in use may leave out the video's subject altogether. To address this, we propose an unsupervised video cropping approach by shaping this as a retargeting and video-to-video synthesis problem. The synthesized video maintains 1:1 aspect ratio, smaller in size and is targeted at the video-subject throughout the whole duration. First, action localization on the individual frames is performed by identifying patches with homogeneous motion patterns and a single salient patch is pin-pointed. To avoid viewpoint jitters and flickering artifacts, any inter-frame scale or position changes among the patches is performed gradually over time. This issue is addressed with a poly-Bezier fitting in 3D space that passes through some chosen pivot timestamps and its shape is influenced by in-between control timestamps. To corroborate the effectiveness of the proposed method, we evaluate the video classification task by comparing our dynamic cropping with static random on three benchmark datasets: UCF-101, HMDB-51 and ActivityNet v1.3. The clip accuracy and top-1 accuracy for video classification after our cropping, outperform 3D CNN performances for same-sized inputs with random crop; sometimes even surpassing larger random crop sizes.
View on arXiv