Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.13786
Cited By
Perception Test: A Diagnostic Benchmark for Multimodal Video Models
23 May 2023
Viorica Puatruaucean
Lucas Smaira
Ankush Gupta
Adrià Recasens Continente
L. Markeeva
Dylan Banarse
Skanda Koppula
Joseph Heyward
Mateusz Malinowski
Yezhou Yang
Carl Doersch
Tatiana Matejovicova
Yury Sulsky
Antoine Miech
A. Fréchette
H. Klimczak
Raphael Köster
Junlin Zhang
Stephanie Winkler
Y. Aytar
Simon Osindero
Dima Damen
Andrew Zisserman
João Carreira
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Perception Test: A Diagnostic Benchmark for Multimodal Video Models"
38 / 38 papers shown
Title
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
Haibo Wang
Bo Feng
Zhengfeng Lai
Mingze Xu
Shiyu Li
Weifeng Ge
Afshin Dehghan
Meng Cao
Ping-Chia Huang
OffRL
51
0
0
08 May 2025
MINERVA: Evaluating Complex Video Reasoning
Arsha Nagrani
Sachit Menon
Ahmet Iscen
Shyamal Buch
Ramin Mehran
...
Yukun Zhu
Carl Vondrick
Mikhail Sirotenko
Cordelia Schmid
Tobias Weyand
58
0
0
01 May 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
X. Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Yu Jiang
ALM
ELM
84
1
0
26 Apr 2025
ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding
Yi-Xing Peng
Q. Yang
Yu-Ming Tang
Shenghao Fu
Kun-Yu Lin
Xihan Wei
Wei-Shi Zheng
42
0
0
25 Apr 2025
VEU-Bench: Towards Comprehensive Understanding of Video Editing
Bozheng Li
Y. Wu
Yi Lu
Jiashuo Yu
Licheng Tang
Jiawang Cao
Wenqing Zhu
Yuyang Sun
Jay Wu
Wenbo Zhu
39
0
0
24 Apr 2025
ZipR1: Reinforcing Token Sparsity in MLLMs
Feng Chen
Yefei He
Lequan Lin
J. Liu
Bohan Zhuang
Qi Wu
46
0
0
23 Apr 2025
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya
Po-Yao (Bernie) Huang
Peize Sun
Jang Hyun Cho
Andrea Madotto
...
Shiyu Dong
Nikhila Ravi
Daniel Li
Piotr Dollár
Christoph Feichtenhofer
ObjD
VOS
103
0
0
17 Apr 2025
How Can Objects Help Video-Language Understanding?
Zitian Tang
Shijie Wang
Junho Cho
Jaewook Yoo
Chen Sun
42
0
0
10 Apr 2025
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Xinhao Li
Ziang Yan
Desen Meng
Lu Dong
Xiangyu Zeng
Yinan He
Y. Wang
Yu Qiao
Yi Wang
Limin Wang
VLM
AI4TS
LRM
40
3
0
09 Apr 2025
TAPNext: Tracking Any Point (TAP) as Next Token Prediction
Artem Zholus
Carl Doersch
Yi Yang
Skanda Koppula
Viorica Patraucean
Xu He
Ignacio Rocco
Mehdi S. M. Sajjadi
Sarath Chandar
Ross Goroshin
30
0
0
08 Apr 2025
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
Dahun Kim
A. Piergiovanni
Ganesh Mallya
A. Angelova
CoGe
41
0
0
04 Apr 2025
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Chuanqi Cheng
Jian-Yu Guan
Wei Yu Wu
Rui Yan
VLM
47
0
0
03 Apr 2025
VITED: Video Temporal Evidence Distillation
Yujie Lu
Yale Song
William Yang Wang
Lorenzo Torresani
Tushar Nagarajan
153
0
0
17 Mar 2025
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
Yudong Liu
Jingwei Sun
Yueqian Lin
Jingyang Zhang
Ming Yin
Qinsi Wang
J. Zhang
H. Li
Y. Chen
VLM
76
2
0
13 Mar 2025
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
Md. Mohaiminul Islam
Tushar Nagarajan
Huiyu Wang
Gedas Bertasius
Lorenzo Torresani
159
0
0
12 Mar 2025
Baichuan-Omni-1.5 Technical Report
Yadong Li
J. Liu
Tao Zhang
Tao Zhang
S. Chen
...
Jianhua Xu
Haoze Sun
Mingan Lin
Zenan Zhou
Weipeng Chen
AuLLM
72
10
0
28 Jan 2025
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan
Hang Zhang
Wentong Li
Zesen Cheng
Boqiang Zhang
...
Deli Zhao
Wenqiao Zhang
Yueting Zhuang
Jianke Zhu
Lidong Bing
72
5
0
31 Dec 2024
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
Pinelopi Papalampidi
Skanda Koppula
Shreya Pathak
Justin T Chiu
Joseph Heyward
Viorica Patraucean
Jiajun Shen
Antoine Miech
Andrew Zisserman
Aida Nematzdeh
VLM
60
24
0
31 Dec 2024
Neptune: The Long Orbit to Benchmarking Long Video Understanding
Arsha Nagrani
Mingda Zhang
Ramin Mehran
Rachel Hornung
N. B. Gundavarapu
...
Boqing Gong
Cordelia Schmid
Mikhail Sirotenko
Yukun Zhu
Tobias Weyand
103
4
0
12 Dec 2024
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
Lu Qiu
Yuying Ge
Yi Chen
Yixiao Ge
Ying Shan
Xihui Liu
LLMAG
LRM
98
5
0
05 Dec 2024
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
Yogesh Kulkarni
Pooyan Fazli
VLM
103
2
0
01 Dec 2024
VideoOrion: Tokenizing Object Dynamics in Videos
Yicheng Feng
Yijiang Li
Wanpeng Zhang
Sipeng Zheng
Zongqing Lu
Sipeng Zheng
Zongqing Lu
109
1
0
25 Nov 2024
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level
Andong Deng
Tongjia Chen
Shoubin Yu
Taojiannan Yang
Lincoln Spencer
Yapeng Tian
Ajmal Saeed Mian
Mohit Bansal
Chen Chen
LRM
65
1
0
15 Nov 2024
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
Lawrence Jang
Yinheng Li
Charles Ding
Justin Lin
Paul Pu Liang
Dan Zhao
Rogerio Bonatti
K. Koishida
46
5
0
24 Oct 2024
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Zuyan Liu
Yuhao Dong
Ziwei Liu
Winston Hu
Jiwen Lu
Yongming Rao
ObjD
83
54
0
19 Sep 2024
Question-Answering Dense Video Events
Hangyu Qin
Junbin Xiao
Angela Yao
VLM
71
1
0
06 Sep 2024
MuMA-ToM: Multi-modal Multi-Agent Theory of Mind
Haojun Shi
Suyu Ye
Xinyu Fang
Chuanyang Jin
Leyla Isik
Yen-Ling Kuo
Tianmin Shu
LLMAG
69
7
0
22 Aug 2024
TAPVid-3D: A Benchmark for Tracking Any Point in 3D
Skanda Koppula
Ignacio Rocco
Yi Yang
Joe Heyward
João Carreira
Andrew Zisserman
Gabriel J. Brostow
Carl Doersch
52
14
0
08 Jul 2024
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs
Rohit K Bharadwaj
Hanan Gani
Muzammal Naseer
F. Khan
Salman Khan
59
3
0
14 Jun 2024
LVBench: An Extreme Long Video Understanding Benchmark
Weihan Wang
Zehai He
Wenyi Hong
Yean Cheng
Xiaohan Zhang
...
Shiyu Huang
Bin Xu
Yuxiao Dong
Ming Ding
Jie Tang
ELM
VLM
47
65
0
12 Jun 2024
CinePile: A Long Video Question Answering Dataset and Benchmark
Ruchit Rawal
Khalid Saifullah
Ronen Basri
David Jacobs
Gowthami Somepalli
Tom Goldstein
43
39
0
14 May 2024
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
Shoubin Yu
Jaehong Yoon
Mohit Bansal
77
4
0
08 Feb 2024
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Rohan Myer Krishnan
Zitian Tang
Zhiqiu Yu
Chen Sun
53
1
0
30 Nov 2023
Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion
Isha Rawal
Alexander Matyasko
Shantanu Jaiswal
Basura Fernando
Cheston Tan
21
1
0
15 Jun 2023
Epic-Sounds: A Large-scale Dataset of Actions That Sound
Jaesung Huh
Jacob Chalk
Evangelos Kazakos
Dima Damen
Andrew Zisserman
EgoV
18
41
0
01 Feb 2023
Robustness Analysis of Video-Language Models Against Visual and Language Perturbations
Madeline Chantry Schiappa
Shruti Vyas
Hamid Palangi
Y. S. Rawat
Vibhav Vineet
VLM
120
17
0
05 Jul 2022
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Kristen Grauman
Andrew Westbury
Eugene Byrne
Zachary Chavis
Antonino Furnari
...
Mike Zheng Shou
Antonio Torralba
Lorenzo Torresani
Mingfei Yan
Jitendra Malik
EgoV
229
1,019
0
13 Oct 2021
TransTrack: Multiple Object Tracking with Transformer
Pei Sun
Jinkun Cao
Yi-Xin Jiang
Rufeng Zhang
Enze Xie
Zehuan Yuan
Changhu Wang
Ping Luo
ViT
VOT
255
566
0
31 Dec 2020
1