Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2405.09711
Cited By
STAR: A Benchmark for Situated Reasoning in Real-World Videos
15 May 2024
Bo Wu
Shoubin Yu
Zhenfang Chen
Joshua B. Tenenbaum
Chuang Gan
Re-assign community
ArXiv
PDF
HTML
Papers citing
"STAR: A Benchmark for Situated Reasoning in Real-World Videos"
50 / 56 papers shown
Title
SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding
Yiming Lei
Chenkai Zhang
Ziqiang Liu
Haitao Leng
Shaoguo Liu
Tingting Gao
Qingjie Liu
Yunhong Wang
AI4TS
77
0
0
30 Apr 2025
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya
Po-Yao (Bernie) Huang
Peize Sun
Jang Hyun Cho
Andrea Madotto
...
Shiyu Dong
Nikhila Ravi
Daniel Li
Piotr Dollár
Christoph Feichtenhofer
ObjD
VOS
161
5
0
17 Apr 2025
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Chuanqi Cheng
Jian Guan
Wei Wu
Rui Yan
VLM
104
0
0
03 Apr 2025
M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
Qingpei Guo
Kaiyou Song
Zipeng Feng
Ziping Ma
Qinglong Zhang
...
Yunxiao Sun
Tai-WeiChang
Jingdong Chen
Ming Yang
Jun Zhou
MLLM
VLM
127
3
0
26 Feb 2025
A Video-grounded Dialogue Dataset and Metric for Event-driven Activities
Wiradee Imrattanatrai
Masaki Asada
Kimihiro Hasegawa
Zhi-Qi Cheng
Ken Fukuda
Teruko Mitamura
VGen
100
0
0
30 Jan 2025
VidCtx: Context-aware Video Question Answering with Image Models
Andreas Goulas
Vasileios Mezaris
Ioannis Patras
368
0
0
23 Dec 2024
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
Chenxin Tao
Shiqian Su
X. Zhu
Chenyu Zhang
Zhe Chen
...
Wenhai Wang
Lewei Lu
Gao Huang
Yu Qiao
Jifeng Dai
MLLM
VLM
156
2
0
20 Dec 2024
Do Language Models Understand Time?
Xi Ding
Lei Wang
220
1
0
18 Dec 2024
Neptune: The Long Orbit to Benchmarking Long Video Understanding
Arsha Nagrani
Ruotong Wang
Ramin Mehran
Rachel Hornung
N. B. Gundavarapu
...
Boqing Gong
Cordelia Schmid
Mikhail Sirotenko
Yukun Zhu
Tobias Weyand
131
4
0
12 Dec 2024
TimeRefine: Temporal Grounding with Time Refining Video LLM
Xizi Wang
Feng Cheng
Ziyang Wang
Huiyu Wang
Md. Mohaiminul Islam
Lorenzo Torresani
Joey Tianyi Zhou
Gedas Bertasius
David J. Crandall
150
2
0
12 Dec 2024
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
Yogesh Kulkarni
Pooyan Fazli
VLM
144
2
0
01 Dec 2024
Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
Yiming Zhang
Zhuokai Zhao
Zhaorun Chen
Zenghui Ding
Xianjun Yang
Yining Sun
406
1
0
21 Nov 2024
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level
Andong Deng
Tongjia Chen
Shoubin Yu
Taojiannan Yang
Lincoln Spencer
Yapeng Tian
Ajmal Mian
Joey Tianyi Zhou
Chen Chen
LRM
83
2
0
15 Nov 2024
Situational Scene Graph for Structured Human-centric Situation Understanding
Chinthani Sugandhika
Chen Li
Deepu Rajan
Basura Fernando
374
1
0
30 Oct 2024
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Gen Luo
Xue Yang
Wenhan Dou
Zhaokai Wang
Jifeng Dai
Jifeng Dai
Yu Qiao
Xizhou Zhu
VLM
MLLM
84
26
0
10 Oct 2024
Frame-Voyager: Learning to Query Frames for Video Large Language Models
Sicheng Yu
Chengkai Jin
Huanyu Wang
Zhenghao Chen
Sheng Jin
...
Zhenbang Sun
Bingni Zhang
Jiawei Wu
Hao Zhang
Qianru Sun
88
5
0
04 Oct 2024
MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset
Weiqi Wang
Yangqiu Song
LRM
59
10
0
04 Jun 2024
Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering
Xingrui Wang
Wufei Ma
Angtian Wang
Shuo Chen
Adam Kortylewski
Alan Yuille
63
4
0
02 Jun 2024
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Ziyang Wang
Shoubin Yu
Elias Stengel-Eskin
Jaehong Yoon
Feng Cheng
Gedas Bertasius
Mohit Bansal
91
60
0
29 May 2024
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
Juhong Min
Shyamal Buch
Arsha Nagrani
Minsu Cho
Cordelia Schmid
LRM
57
24
0
09 Apr 2024
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
Ahmad A Mahmood
Ashmal Vayani
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
LRM
73
6
0
21 Mar 2024
PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning
Yining Hong
Li Yi
J. Tenenbaum
Antonio Torralba
Chuang Gan
29
39
0
09 Dec 2021
Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language
Mingyu Ding
Zhenfang Chen
Tao Du
Ping Luo
J. Tenenbaum
Chuang Gan
VGen
PINN
OCL
59
74
0
28 Oct 2021
Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning
Zhenfang Chen
Jiayuan Mao
Jiajun Wu
Kwan-Yee K. Wong
J. Tenenbaum
Chuang Gan
VGen
51
93
0
30 Mar 2021
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
Madeleine Grunde-McLaughlin
Ranjay Krishna
Maneesh Agrawala
CoGe
39
113
0
30 Mar 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Joey Tianyi Zhou
Jingjing Liu
CLIP
93
651
0
11 Feb 2021
GECToR -- Grammatical Error Correction: Tag, Not Rewrite
Kostiantyn Omelianchuk
Vitaliy Atrasevych
Artem Chernodub
Oleksandr Skurzhanskyi
51
310
0
26 May 2020
Unbiased Scene Graph Generation from Biased Training
Kaihua Tang
Yulei Niu
Jianqiang Huang
Jiaxin Shi
Hanwang Zhang
CML
58
689
0
27 Feb 2020
Hierarchical Conditional Relation Networks for Video Question Answering
T. Le
Vuong Le
Svetha Venkatesh
T. Tran
39
258
0
25 Feb 2020
Action Genome: Actions as Composition of Spatio-temporal Scene Graphs
Jingwei Ji
Ranjay Krishna
Li Fei-Fei
Juan Carlos Niebles
57
339
0
15 Dec 2019
CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning
Rohit Girdhar
Deva Ramanan
39
177
0
10 Oct 2019
CLEVRER: CoLlision Events for Video REpresentation and Reasoning
Kexin Yi
Yuta Saito
Yunzhu Li
Pushmeet Kohli
Jiajun Wu
Antonio Torralba
J. Tenenbaum
NAI
76
461
0
03 Oct 2019
VisualBERT: A Simple and Performant Baseline for Vision and Language
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
VLM
102
1,939
0
09 Aug 2019
Language-Conditioned Graph Networks for Relational Reasoning
Ronghang Hu
Anna Rohrbach
Trevor Darrell
Kate Saenko
62
172
0
10 May 2019
TVQA+: Spatio-Temporal Grounding for Video Question Answering
Jie Lei
Licheng Yu
Tamara L. Berg
Joey Tianyi Zhou
52
228
0
25 Apr 2019
From Recognition to Cognition: Visual Commonsense Reasoning
Rowan Zellers
Yonatan Bisk
Ali Farhadi
Yejin Choi
LRM
BDL
OCL
ReLM
129
873
0
27 Nov 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
870
93,936
0
11 Oct 2018
Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding
Kexin Yi
Jiajun Wu
Chuang Gan
Antonio Torralba
Pushmeet Kohli
J. Tenenbaum
NAI
65
606
0
04 Oct 2018
TVQA: Localized, Compositional Video Question Answering
Muhammad Abdul Wahab
Licheng Yu
Mounir Nasr Allah
Tamara L. Berg
53
619
0
05 Sep 2018
A Dataset and Architecture for Visual Reasoning with a Working Memory
G. R. Yang
Igor Ganichev
Xiao-Jing Wang
Jonathon Shlens
David Sussillo
35
54
0
16 Mar 2018
VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation
Chuang Gan
Yandong Li
Haoxiang Li
Chen Sun
Boqing Gong
36
127
0
15 Aug 2017
What Actions are Needed for Understanding Human Actions in Videos?
Gunnar Sigurdsson
Olga Russakovsky
Abhinav Gupta
39
133
0
09 Aug 2017
Inferring and Executing Programs for Visual Reasoning
Justin Johnson
B. Hariharan
Laurens van der Maaten
Judy Hoffman
Li Fei-Fei
C. L. Zitnick
Ross B. Girshick
NAI
59
542
0
10 May 2017
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
Y. Jang
Yale Song
Youngjae Yu
Youngjin Kim
Gunhee Kim
45
548
0
14 Apr 2017
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
Justin Johnson
B. Hariharan
Laurens van der Maaten
Li Fei-Fei
C. L. Zitnick
Ross B. Girshick
CoGe
271
2,346
0
20 Dec 2016
MarioQA: Answering Questions by Watching Gameplay Videos
Jonghwan Mun
Paul Hongsuck Seo
Ilchae Jung
Bohyung Han
61
108
0
06 Dec 2016
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
285
3,187
0
02 Dec 2016
RMPE: Regional Multi-person Pose Estimation
Haoshu Fang
Shuqin Xie
Yu-Wing Tai
Cewu Lu
3DH
99
1,583
0
01 Dec 2016
Aggregated Residual Transformations for Deep Neural Networks
Saining Xie
Ross B. Girshick
Piotr Dollár
Zhuowen Tu
Kaiming He
415
10,281
0
16 Nov 2016
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Ranjay Krishna
Yuke Zhu
Oliver Groth
Justin Johnson
Kenji Hata
...
Yannis Kalantidis
Li Li
David A. Shamma
Michael S. Bernstein
Fei-Fei Li
161
5,706
0
23 Feb 2016
1
2
Next