Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2106.11310
Cited By
Towards Long-Form Video Understanding
21 June 2021
Chaoxia Wu
Philipp Krahenbuhl
VLM
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Towards Long-Form Video Understanding"
50 / 121 papers shown
Title
Action Scene Graphs for Long-Form Understanding of Egocentric Videos
Ivan Rodin
Antonino Furnari
Kyle Min
Subarna Tripathi
G. Farinella
EgoV
27
12
0
06 Dec 2023
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Rohan Myer Krishnan
Zitian Tang
Zhiqiu Yu
Chen Sun
53
1
0
30 Nov 2023
Towards Weakly Supervised End-to-end Learning for Long-video Action Recognition
Jiaming Zhou
Hanjun Li
Kun-Yu Lin
Junwei Liang
23
1
0
28 Nov 2023
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
A. Piergiovanni
Isaac Noble
Dahun Kim
Michael S. Ryoo
Victor Gomes
A. Angelova
36
19
0
09 Nov 2023
LabelFormer: Object Trajectory Refinement for Offboard Perception from LiDAR Point Clouds
Anqi Joyce Yang
Sergio Casas
Nikita Dvornik
Sean Segal
Yuwen Xiong
Jordan Sir Kwang Hu
Carter Fang
R. Urtasun
37
6
0
02 Nov 2023
Object-centric Video Representation for Long-term Action Anticipation
Ce Zhang
Changcheng Fu
Shijie Wang
Nakul Agarwal
Kwonjoon Lee
Chiho Choi
Chen Sun
17
14
0
31 Oct 2023
MM-VID: Advancing Video Understanding with GPT-4V(ision)
Kevin Qinghong Lin
Faisal Ahmed
Linjie Li
Chung-Ching Lin
E. Azarnasab
...
Lin Liang
Zicheng Liu
Yumao Lu
Ce Liu
Lijuan Wang
MLLM
28
63
0
30 Oct 2023
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
Shuhuai Ren
Sishuo Chen
Shicheng Li
Xu Sun
Lu Hou
ViT
43
28
0
29 Oct 2023
Query-aware Long Video Localization and Relation Discrimination for Deep Video Understanding
Yuanxing Xu
Yuting Wei
Bin Wu
25
0
0
19 Oct 2023
Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding
Mohamed Afham
Satya Narayan Shukla
Omid Poursaeed
Pengchuan Zhang
Ashish Shah
Sernam Lim
VLM
26
2
0
20 Sep 2023
Specification-Driven Video Search via Foundation Models and Formal Verification
Yunhao Yang
Jean-Raphael Gaglione
Sandeep P. Chinchali
Ufuk Topcu
60
5
0
18 Sep 2023
Predicting Routine Object Usage for Proactive Robot Assistance
Maithili Patel
Aswin Prakash
Sonia Chernova
AI4TS
34
8
0
12 Sep 2023
Knowledge-Guided Short-Context Action Anticipation in Human-Centric Videos
Sarthak Bhagat
Simon Stepputtis
Joseph Campbell
Katia P. Sycara
33
4
0
12 Sep 2023
Frequency-Aware Self-Supervised Long-Tailed Learning
Ci-Siang Lin
Min-Hung Chen
Y. Wang
20
3
0
09 Sep 2023
Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior
Ashmit Khandelwal
Aditya Agrawal
Aanisha Bhattacharyya
Yaman Kumar Singla
Somesh Singh
...
Ishita Dasgupta
Stefano Petrangeli
R. Shah
Changyou Chen
Balaji Krishnamurthy
18
8
0
01 Sep 2023
Masked Feature Modelling: Feature Masking for the Unsupervised Pre-training of a Graph Attention Network Block for Bottom-up Video Event Recognition
Dimitrios Daskalakis
Nikolaos Gkalelis
Vasileios Mezaris
36
0
0
24 Aug 2023
Are current long-term video understanding datasets long-term?
Ombretta Strafforello
Klamer Schutte
J. C. V. Gemert
19
8
0
22 Aug 2023
Long-range Multimodal Pretraining for Movie Understanding
Dawit Mureja Argaw
Joon-Young Lee
Markus Woodson
In So Kweon
Fabian Caba Heilbron
VLM
30
7
0
18 Aug 2023
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
K. Mangalam
Raiymbek Akshulakov
Jitendra Malik
25
247
0
17 Aug 2023
Human-centered NLP Fact-checking: Co-Designing with Fact-checkers using Matchmaking for AI
Houjiang Liu
Anubrata Das
Alexander Boltz
Didi Zhou
Daisy Pinaroc
Matthew Lease
Min Kyung Lee
HAI
25
16
0
14 Aug 2023
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Enxin Song
Wenhao Chai
Guanhong Wang
Yucheng Zhang
Haoyang Zhou
...
Tianbo Ye
Yanting Zhang
Yang Lu
Jenq-Neng Hwang
Gaoang Wang
VLM
MLLM
22
262
0
31 Jul 2023
SUIT: Learning Significance-guided Information for 3D Temporal Detection
Zheyuan Zhou
Jiachen Lu
Yi Zeng
Hang Xu
Li Zhang
3DPC
35
2
0
04 Jul 2023
How can objects help action recognition?
Xingyi Zhou
Anurag Arnab
Chen Sun
Cordelia Schmid
35
14
0
20 Jun 2023
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot
Aanisha Bhattacharya
Yaman Kumar Singla
Balaji Krishnamurthy
R. Shah
Changyou Chen
VGen
26
11
0
16 May 2023
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
Roei Herzig
Alon Mendelson
Leonid Karlinsky
Assaf Arbelle
Rogerio Feris
Trevor Darrell
Amir Globerson
VLM
32
31
0
10 May 2023
How you feelin'? Learning Emotions and Mental States in Movie Scenes
D. Srivastava
A. Singh
Makarand Tapaswi
32
10
0
12 Apr 2023
Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning
Nikhil Singh
Chih-Wei Wu
Iroro Orife
Mahdi M. Kalayeh
23
2
0
12 Apr 2023
On the Benefits of 3D Pose and Tracking for Human Action Recognition
Jathushan Rajasegaran
Georgios Pavlakos
Angjoo Kanazawa
Christoph Feichtenhofer
Jitendra Malik
36
30
0
03 Apr 2023
Selective Structured State-Spaces for Long-Form Video Understanding
Jue Wang
Wenjie Zhu
Pichao Wang
Xiang Yu
Linda Liu
Mohamed Omar
Raffay Hamid
41
94
0
25 Mar 2023
Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation Using a New Frame Selection Policy and Gating Mechanism
Nikolaos Gkalelis
Dimitrios Daskalakis
Vasileios Mezaris
18
4
0
18 Jan 2023
Building Scalable Video Understanding Benchmarks through Sports
Aniket Agarwal
Alex Zhang
Karthik Narasimhan
Igor Gilitschenski
Vishvak Murahari
Yash Kant
19
1
0
17 Jan 2023
HierVL: Learning Hierarchical Video-Language Embeddings
Kumar Ashutosh
Rohit Girdhar
Lorenzo Torresani
Kristen Grauman
VLM
AI4TS
22
51
0
05 Jan 2023
Efficient Movie Scene Detection using State-Space Transformers
Md. Mohaiminul Islam
Mahmudul Hasan
Kishan Athrey
Tony Braskich
Gedas Bertasius
ViT
36
44
0
29 Dec 2022
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
Difei Gao
Luowei Zhou
Lei Ji
Linchao Zhu
Yezhou Yang
Mike Zheng Shou
38
60
0
19 Dec 2022
Deep Architectures for Content Moderation and Movie Content Rating
Fatih Çagatay Akyön
A. Temi̇zel
33
4
0
08 Dec 2022
PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data
Roei Herzig
Ofir Abramovich
Elad Ben-Avraham
Assaf Arbelle
Leonid Karlinsky
Ariel Shamir
Trevor Darrell
Amir Globerson
41
16
0
08 Dec 2022
Spatio-Temporal Crop Aggregation for Video Representation Learning
Sepehr Sameni
Simon Jenni
Paolo Favaro
21
3
0
30 Nov 2022
Soft-Landing Strategy for Alleviating the Task Discrepancy Problem in Temporal Action Localization Tasks
Hyolim Kang
Hanjung Kim
Joungbin An
Minsu Cho
Seon Joo Kim
32
5
0
11 Nov 2022
Zero-shot Video Moment Retrieval With Off-the-Shelf Models
Anuj Diwan
Puyuan Peng
Raymond J. Mooney
VLM
28
3
0
03 Nov 2022
End-to-end Transformer for Compressed Video Quality Enhancement
Li Yu
Wenshuai Chang
Shiyu Wu
Moncef Gabbouj
ViT
24
8
0
25 Oct 2022
Holistic Interaction Transformer Network for Action Detection
Gueter Josmy Faure
Min-Hung Chen
S. Lai
33
37
0
23 Oct 2022
MovieCLIP: Visual Scene Recognition in Movies
Digbalay Bose
Rajat Hebbar
Krishna Somandepalli
Haoyang Zhang
Huayu Chen
K. Cole-McLaughlin
Haoran Wang
Shrikanth Narayanan
CLIP
12
20
0
20 Oct 2022
S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces
Eric N. D. Nguyen
Karan Goel
Albert Gu
Gordon W. Downs
Preey Shah
Tri Dao
S. Baccus
Christopher Ré
VLM
22
38
0
12 Oct 2022
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
Yuchong Sun
Hongwei Xue
Ruihua Song
Bei Liu
Huan Yang
Jianlong Fu
AI4TS
VLM
18
68
0
12 Oct 2022
CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding
Zhijian Hou
Wanjun Zhong
Lei Ji
Difei Gao
Kun Yan
W. Chan
Chong-Wah Ngo
Zheng Shou
Nan Duan
AI4TS
34
24
0
22 Sep 2022
Self-Contained Entity Discovery from Captioned Videos
M. Ayoughi
P. Mettes
Paul T. Groth
28
2
0
13 Aug 2022
Vision-Centric BEV Perception: A Survey
Yuexin Ma
Tai Wang
Xuyang Bai
Huitong Yang
Yuenan Hou
Yaming Wang
Yu Qiao
Ruigang Yang
Tianyi Zhou
Xinge Zhu
45
129
0
04 Aug 2022
Two-Stream Transformer Architecture for Long Video Understanding
Edward Fish
Jon Weinbren
Andrew Gilbert
ViT
27
6
0
02 Aug 2022
Video Question Answering with Iterative Video-Text Co-Tokenization
A. Piergiovanni
K. Morton
Weicheng Kuo
Michael S. Ryoo
A. Angelova
22
18
0
01 Aug 2022
EgoEnv: Human-centric environment representations from egocentric video
Tushar Nagarajan
Santhosh Kumar Ramakrishnan
Ruta Desai
James M. Hillis
Kristen Grauman
EgoV
33
19
0
22 Jul 2022
Previous
1
2
3
Next