CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

14 September 2022

Papers citing "CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment"

8 / 58 papers shown

Title
Cross-Modal and Hierarchical Modeling of Video and Text Bowen Zhang Hexiang Hu Fei Sha BDL AI4TS 58 191 0 16 Oct 2018
A Joint Sequence Fusion Model for Video Question Answering and Retrieval Youngjae Yu Jongseok Kim Gunhee Kim 85 345 0 07 Aug 2018
Localizing Moments in Video with Natural Language Lisa Anne Hendricks Oliver Wang Eli Shechtman Josef Sivic Trevor Darrell Bryan C. Russell 115 946 0 04 Aug 2017
The "something something" video database for learning and evaluating visual common sense Raghav Goyal Samira Ebrahimi Kahou Vincent Michalski Joanna Materzynska S. Westphal ... Moritz Mueller-Freitag F. Hoppe Christian Thurau Ingo Bax Roland Memisevic VLM 84 1,531 0 13 Jun 2017
Dense-Captioning Events in Videos Ranjay Krishna Kenji Hata F. Ren Li Fei-Fei Juan Carlos Niebles 136 1,244 0 02 May 2017
Movie Description Anna Rohrbach Atousa Torabi Marcus Rohrbach Niket Tandon C. Pal Hugo Larochelle Aaron Courville Bernt Schiele 3DV VGen 79 357 0 12 May 2016
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations Ranjay Krishna Yuke Zhu Oliver Groth Justin Johnson Kenji Hata ... Yannis Kalantidis Li Li David A. Shamma Michael S. Bernstein Fei-Fei Li 215 5,743 0 23 Feb 2016
Microsoft COCO: Common Objects in Context Nayeon Lee Michael Maire Serge J. Belongie Lubomir Bourdev Ross B. Girshick James Hays Pietro Perona Deva Ramanan C. L. Zitnick Piotr Dollár ObjD 413 43,638 0 01 May 2014