Towards Long-Form Video Understanding

21 June 2021

Papers citing "Towards Long-Form Video Understanding"

50 / 121 papers shown

Title
Action Scene Graphs for Long-Form Understanding of Egocentric Videos Ivan Rodin Antonino Furnari Kyle Min Subarna Tripathi G. Farinella EgoV 27 12 0 06 Dec 2023
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains Rohan Myer Krishnan Zitian Tang Zhiqiu Yu Chen Sun 53 1 0 30 Nov 2023
Towards Weakly Supervised End-to-end Learning for Long-video Action Recognition Jiaming Zhou Hanjun Li Kun-Yu Lin Junwei Liang 23 1 0 28 Nov 2023
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities A. Piergiovanni Isaac Noble Dahun Kim Michael S. Ryoo Victor Gomes A. Angelova 36 19 0 09 Nov 2023
LabelFormer: Object Trajectory Refinement for Offboard Perception from LiDAR Point Clouds Anqi Joyce Yang Sergio Casas Nikita Dvornik Sean Segal Yuwen Xiong Jordan Sir Kwang Hu Carter Fang R. Urtasun 37 6 0 02 Nov 2023
Object-centric Video Representation for Long-term Action Anticipation Ce Zhang Changcheng Fu Shijie Wang Nakul Agarwal Kwonjoon Lee Chiho Choi Chen Sun 17 14 0 31 Oct 2023
MM-VID: Advancing Video Understanding with GPT-4V(ision) Kevin Qinghong Lin Faisal Ahmed Linjie Li Chung-Ching Lin E. Azarnasab ... Lin Liang Zicheng Liu Yumao Lu Ce Liu Lijuan Wang MLLM 28 63 0 30 Oct 2023
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding Shuhuai Ren Sishuo Chen Shicheng Li Xu Sun Lu Hou ViT 43 28 0 29 Oct 2023
Query-aware Long Video Localization and Relation Discrimination for Deep Video Understanding Yuanxing Xu Yuting Wei Bin Wu 25 0 0 19 Oct 2023
Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding Mohamed Afham Satya Narayan Shukla Omid Poursaeed Pengchuan Zhang Ashish Shah Sernam Lim VLM 26 2 0 20 Sep 2023
Specification-Driven Video Search via Foundation Models and Formal Verification Yunhao Yang Jean-Raphael Gaglione Sandeep P. Chinchali Ufuk Topcu 60 5 0 18 Sep 2023
Predicting Routine Object Usage for Proactive Robot Assistance Maithili Patel Aswin Prakash Sonia Chernova AI4TS 34 8 0 12 Sep 2023
Knowledge-Guided Short-Context Action Anticipation in Human-Centric Videos Sarthak Bhagat Simon Stepputtis Joseph Campbell Katia P. Sycara 33 4 0 12 Sep 2023
Frequency-Aware Self-Supervised Long-Tailed Learning Ci-Siang Lin Min-Hung Chen Y. Wang 20 3 0 09 Sep 2023
Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior Ashmit Khandelwal Aditya Agrawal Aanisha Bhattacharyya Yaman Kumar Singla Somesh Singh ... Ishita Dasgupta Stefano Petrangeli R. Shah Changyou Chen Balaji Krishnamurthy 18 8 0 01 Sep 2023
Masked Feature Modelling: Feature Masking for the Unsupervised Pre-training of a Graph Attention Network Block for Bottom-up Video Event Recognition Dimitrios Daskalakis Nikolaos Gkalelis Vasileios Mezaris 36 0 0 24 Aug 2023
Are current long-term video understanding datasets long-term? Ombretta Strafforello Klamer Schutte J. C. V. Gemert 19 8 0 22 Aug 2023
Long-range Multimodal Pretraining for Movie Understanding Dawit Mureja Argaw Joon-Young Lee Markus Woodson In So Kweon Fabian Caba Heilbron VLM 30 7 0 18 Aug 2023
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding K. Mangalam Raiymbek Akshulakov Jitendra Malik 25 247 0 17 Aug 2023
Human-centered NLP Fact-checking: Co-Designing with Fact-checkers using Matchmaking for AI Houjiang Liu Anubrata Das Alexander Boltz Didi Zhou Daisy Pinaroc Matthew Lease Min Kyung Lee HAI 25 16 0 14 Aug 2023
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Enxin Song Wenhao Chai Guanhong Wang Yucheng Zhang Haoyang Zhou ... Tianbo Ye Yanting Zhang Yang Lu Jenq-Neng Hwang Gaoang Wang VLM MLLM 22 262 0 31 Jul 2023
SUIT: Learning Significance-guided Information for 3D Temporal Detection Zheyuan Zhou Jiachen Lu Yi Zeng Hang Xu Li Zhang 3DPC 35 2 0 04 Jul 2023
How can objects help action recognition? Xingyi Zhou Anurag Arnab Chen Sun Cordelia Schmid 35 14 0 20 Jun 2023
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot Aanisha Bhattacharya Yaman Kumar Singla Balaji Krishnamurthy R. Shah Changyou Chen VGen 26 11 0 16 May 2023
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs Roei Herzig Alon Mendelson Leonid Karlinsky Assaf Arbelle Rogerio Feris Trevor Darrell Amir Globerson VLM 32 31 0 10 May 2023
How you feelin'? Learning Emotions and Mental States in Movie Scenes D. Srivastava A. Singh Makarand Tapaswi 32 10 0 12 Apr 2023
Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning Nikhil Singh Chih-Wei Wu Iroro Orife Mahdi M. Kalayeh 23 2 0 12 Apr 2023
On the Benefits of 3D Pose and Tracking for Human Action Recognition Jathushan Rajasegaran Georgios Pavlakos Angjoo Kanazawa Christoph Feichtenhofer Jitendra Malik 36 30 0 03 Apr 2023
Selective Structured State-Spaces for Long-Form Video Understanding Jue Wang Wenjie Zhu Pichao Wang Xiang Yu Linda Liu Mohamed Omar Raffay Hamid 41 94 0 25 Mar 2023
Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation Using a New Frame Selection Policy and Gating Mechanism Nikolaos Gkalelis Dimitrios Daskalakis Vasileios Mezaris 18 4 0 18 Jan 2023
Building Scalable Video Understanding Benchmarks through Sports Aniket Agarwal Alex Zhang Karthik Narasimhan Igor Gilitschenski Vishvak Murahari Yash Kant 19 1 0 17 Jan 2023
HierVL: Learning Hierarchical Video-Language Embeddings Kumar Ashutosh Rohit Girdhar Lorenzo Torresani Kristen Grauman VLM AI4TS 22 51 0 05 Jan 2023
Efficient Movie Scene Detection using State-Space Transformers Md. Mohaiminul Islam Mahmudul Hasan Kishan Athrey Tony Braskich Gedas Bertasius ViT 36 44 0 29 Dec 2022
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering Difei Gao Luowei Zhou Lei Ji Linchao Zhu Yezhou Yang Mike Zheng Shou 38 60 0 19 Dec 2022
Deep Architectures for Content Moderation and Movie Content Rating Fatih Çagatay Akyön A. Temi̇zel 33 4 0 08 Dec 2022
PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data Roei Herzig Ofir Abramovich Elad Ben-Avraham Assaf Arbelle Leonid Karlinsky Ariel Shamir Trevor Darrell Amir Globerson 41 16 0 08 Dec 2022
Spatio-Temporal Crop Aggregation for Video Representation Learning Sepehr Sameni Simon Jenni Paolo Favaro 21 3 0 30 Nov 2022
Soft-Landing Strategy for Alleviating the Task Discrepancy Problem in Temporal Action Localization Tasks Hyolim Kang Hanjung Kim Joungbin An Minsu Cho Seon Joo Kim 32 5 0 11 Nov 2022
Zero-shot Video Moment Retrieval With Off-the-Shelf Models Anuj Diwan Puyuan Peng Raymond J. Mooney VLM 28 3 0 03 Nov 2022
End-to-end Transformer for Compressed Video Quality Enhancement Li Yu Wenshuai Chang Shiyu Wu Moncef Gabbouj ViT 24 8 0 25 Oct 2022
Holistic Interaction Transformer Network for Action Detection Gueter Josmy Faure Min-Hung Chen S. Lai 33 37 0 23 Oct 2022
MovieCLIP: Visual Scene Recognition in Movies Digbalay Bose Rajat Hebbar Krishna Somandepalli Haoyang Zhang Huayu Chen K. Cole-McLaughlin Haoran Wang Shrikanth Narayanan CLIP 12 20 0 20 Oct 2022
S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces Eric N. D. Nguyen Karan Goel Albert Gu Gordon W. Downs Preey Shah Tri Dao S. Baccus Christopher Ré VLM 22 38 0 12 Oct 2022
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning Yuchong Sun Hongwei Xue Ruihua Song Bei Liu Huan Yang Jianlong Fu AI4TS VLM 18 68 0 12 Oct 2022
CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding Zhijian Hou Wanjun Zhong Lei Ji Difei Gao Kun Yan W. Chan Chong-Wah Ngo Zheng Shou Nan Duan AI4TS 34 24 0 22 Sep 2022
Self-Contained Entity Discovery from Captioned Videos M. Ayoughi P. Mettes Paul T. Groth 28 2 0 13 Aug 2022
Vision-Centric BEV Perception: A Survey Yuexin Ma Tai Wang Xuyang Bai Huitong Yang Yuenan Hou Yaming Wang Yu Qiao Ruigang Yang Tianyi Zhou Xinge Zhu 45 129 0 04 Aug 2022
Two-Stream Transformer Architecture for Long Video Understanding Edward Fish Jon Weinbren Andrew Gilbert ViT 27 6 0 02 Aug 2022
Video Question Answering with Iterative Video-Text Co-Tokenization A. Piergiovanni K. Morton Weicheng Kuo Michael S. Ryoo A. Angelova 22 18 0 01 Aug 2022
EgoEnv: Human-centric environment representations from egocentric video Tushar Nagarajan Santhosh Kumar Ramakrishnan Ruta Desai James M. Hillis Kristen Grauman EgoV 33 19 0 22 Jul 2022