20
v1v2 (latest)

ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search

Tao Yu
Haopeng Jin
Hao Wang
Shenghua Chai
Yujia Yang
Junhao Gong
Jiaming Guo
Minghui Zhang
Xinlong Chen
Zhenghao Zhang
Yuxuan Zhou
Yufei Xiong
Shanbin Zhang
Jiabing Yang
Hongzhu Yi
Xinming Wang
Cheng Zhong
Xiao Ma
Zhang Zhang
Yan Huang
Liang Wang
Main:7 Pages
7 Figures
Bibliography:3 Pages
1 Tables
Appendix:18 Pages
Abstract

In recent years, large language models (LLMs) have made rapid progress in information retrieval, yet existing research has mainly focused on text or static multimodal settings. Open-domain video shot retrieval, which involves richer temporal structure and more complex semantics, still lacks systematic benchmarks and analysis. To fill this gap, we introduce ShotFinder, a benchmark that formalizes editing requirements as keyframe-oriented shot descriptions and introduces five types of controllable single-factor constraints: Temporal order, Color, Visual style, Audio, and Resolution. We curate 1,210 high-quality samples from YouTube across 20 thematic categories, using large models for generation with human verification. Based on the benchmark, we propose ShotFinder, a text-driven three-stage retrieval and localization pipeline: (1) query expansion via video imagination, (2) candidate video retrieval with a search engine, and (3) description-guided temporal localization. Experiments on multiple closed-source and open-source models reveal a significant gap to human performance, with clear imbalance across constraints: temporal localization is relatively tractable, while color and visual style remain major challenges. These results reveal that open-domain video shot retrieval is still a critical capability that multimodal large models have yet to overcome.

View on arXiv
Comments on this paper