REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing

Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote' from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video. We propose a novel retrieval-embedded generation framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative. Our proposed REGen system first generates the output story script with quote placeholders using a finetuned large language model, and then uses a novel retrieval model to replace the quote placeholders by selecting a video clip that best supports the narrative from a pool of candidate quotable video clips. We examine the proposed method on the task of documentary teaser generation, where short interview insertions are commonly used to support the narrative of a documentary. Our objective evaluations show that the proposed method can effectively insert short video clips while maintaining a coherent narrative. In a subjective survey, we show that our proposed method outperforms existing abstractive and extractive approaches in terms of coherence, alignment, and realism in teaser generation.
View on arXiv@article{xu2025_2505.18880, title={ REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing }, author={ Weihan Xu and Yimeng Ma and Jingyue Huang and Yang Li and Wenye Ma and Taylor Berg-Kirkpatrick and Julian McAuley and Paul Pu Liang and Hao-Wen Dong }, journal={arXiv preprint arXiv:2505.18880}, year={ 2025 } }