ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.20426
  4. Cited By
MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

26 May 2025
Yunlong Tang
Pinxin Liu
Mingqian Feng
Zhangyun Tan
Rui Mao
Chao Huang
Jing Bi
Yunzhong Xiao
Susan Liang
Hang Hua
Ali Vosoughi
Luchuan Song
Zeliang Zhang
Chenliang Xu
    LRM
ArXiv (abs)PDFHTML

Papers citing "MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness"

47 / 47 papers shown
Title
Intentional Gesture: Deliver Your Intentions with Gestures for Speech
Intentional Gesture: Deliver Your Intentions with Gestures for Speech
Pinxin Liu
Haiyang Liu
Luchuan Song
Chenliang Xu
SLR
48
1
0
21 May 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu
Weiyun Wang
Zhe Chen
Ziwei Liu
Shenglong Ye
...
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
Wei Wang
MLLMVLM
191
130
1
14 Apr 2025
Towards Visual Text Grounding of Multimodal Large Language Model
Towards Visual Text Grounding of Multimodal Large Language Model
Ming Li
Ruiyi Zhang
Jian Chen
Jiuxiang Gu
Yufan Zhou
Franck Dernoncourt
Wanrong Zhu
Dinesh Manocha
Tong Sun
96
3
0
07 Apr 2025
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
Yunlong Tang
Jing Bi
Chao Huang
Susan Liang
Daiki Shimada
...
Jinxi He
Liu He
Zeliang Zhang
Jiebo Luo
Chenliang Xu
83
1
0
07 Apr 2025
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
Jing Bi
Junjia Guo
Susan Liang
Guangyu Sun
Luchuan Song
...
Jinxi He
Jiarui Wu
Ali Vosoughi
Chong Chen
Chenliang Xu
LRM
102
8
0
14 Mar 2025
Qwen2.5-VL Technical Report
Qwen2.5-VL Technical Report
S. Bai
Keqin Chen
Xuejing Liu
Jialin Wang
Wenbin Ge
...
Zesen Cheng
Hang Zhang
Zhibo Yang
Haiyang Xu
Junyang Lin
VLM
333
685
0
20 Feb 2025
Unveiling Visual Perception in Language Models: An Attention Head
  Analysis Approach
Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach
Jing Bi
Junjia Guo
Yunlong Tang
Lianggong Wen
Zhang Liu
Chenliang Xu
43
6
0
24 Dec 2024
KinMo: Kinematic-aware Human Motion Understanding and Generation
KinMo: Kinematic-aware Human Motion Understanding and Generation
Pengfei Zhang
Pinxin Liu
Hyeongwoo Kim
Pablo Garrido
Bindita Chaudhuri
151
2
0
23 Nov 2024
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?
Yunlong Tang
Junjia Guo
Hang Hua
Susan Liang
Mingqian Feng
...
Chao Huang
Jing Bi
Zeliang Zhang
Pooyan Fazli
Chenliang Xu
CoGe
112
11
0
17 Nov 2024
GPT-4o System Card
GPT-4o System Card
OpenAI OpenAI
:
Aaron Hurst
Adam Lerer
Adam P. Goucher
...
Yuchen He
Yuchen Zhang
Yujia Jin
Yunxing Dai
Yury Malkov
MLLM
204
1,019
0
25 Oct 2024
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Baiqi Li
Zhiqiu Lin
Wenxuan Peng
Jean de Dieu Nyandwi
Daniel Jiang
Zixian Ma
Simran Khanuja
Ranjay Krishna
Graham Neubig
Deva Ramanan
AAMLCoGeVLM
163
31
0
18 Oct 2024
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained
  Vision-Language Models
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models
Hang Hua
Yunlong Tang
Ziyun Zeng
Liangliang Cao
Zhengyuan Yang
Hangfeng He
Chenliang Xu
Jiebo Luo
VLMCoGe
70
13
0
13 Oct 2024
Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation
Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation
Susan Liang
Chao Huang
Yapeng Tian
Anurag Kumar
Chenliang Xu
DiffM
84
8
0
09 Oct 2024
TextToon: Real-Time Text Toonify Head Avatar from Single Video
TextToon: Real-Time Text Toonify Head Avatar from Single Video
Luchuan Song
Lele Chen
Celong Liu
Pinxin Liu
Chenliang Xu
DiffM
72
9
0
23 Sep 2024
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi
Fuxiao Liu
Shihao Wang
Shijia Liao
Subhashree Radhakrishnan
...
Andrew Tao
Andrew Tao
Zhiding Yu
Guilin Liu
Guilin Liu
MLLM
120
68
0
28 Aug 2024
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li
Yuanhan Zhang
Dong Guo
Renrui Zhang
Feng Li
Hao Zhang
Kaichen Zhang
Yanwei Li
Ziwei Liu
Chunyuan Li
MLLMSyDaVLM
117
860
0
06 Aug 2024
MuirBench: A Comprehensive Benchmark for Robust Multi-image
  Understanding
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
Fei Wang
Xingyu Fu
James Y. Huang
Zekun Li
Qin Liu
...
Kai-Wei Chang
Dan Roth
Sheng Zhang
Hoifung Poon
Muhao Chen
VLM
103
59
0
13 Jun 2024
LVBench: An Extreme Long Video Understanding Benchmark
LVBench: An Extreme Long Video Understanding Benchmark
Weihan Wang
Zehai He
Wenyi Hong
Yean Cheng
Xiaohan Zhang
...
Shiyu Huang
Bin Xu
Yuxiao Dong
Ming Ding
Jie Tang
ELMVLM
117
91
0
12 Jun 2024
An Empirical Analysis on Large Language Models in Debate Evaluation
An Empirical Analysis on Large Language Models in Debate Evaluation
Xinyi Liu
Pinxin Liu
Hangfeng He
ELM
62
6
0
28 May 2024
DOCCI: Descriptions of Connected and Contrasting Images
DOCCI: Descriptions of Connected and Contrasting Images
Yasumasa Onoe
Sunayana Rane
Zachary Berger
Yonatan Bitton
Jaemin Cho
...
Zarana Parekh
Jordi Pont-Tuset
Garrett Tanzer
Su Wang
Jason Baldridge
81
63
0
30 Apr 2024
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection
  and Correction
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
Hang Hua
Jing Shi
Kushal Kafle
Simon Jenni
Daoan Zhang
John Collomosse
Scott D. Cohen
Jiebo Luo
CoGeVLM
78
12
0
23 Apr 2024
Are We on the Right Way for Evaluating Large Vision-Language Models?
Are We on the Right Way for Evaluating Large Vision-Language Models?
Lin Chen
Jinsong Li
Xiao-wen Dong
Pan Zhang
Yuhang Zang
...
Haodong Duan
Jiaqi Wang
Yu Qiao
Dahua Lin
Feng Zhao
VLM
118
302
0
29 Mar 2024
InternVL: Scaling up Vision Foundation Models and Aligning for Generic
  Visual-Linguistic Tasks
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLMMLLM
254
1,210
0
21 Dec 2023
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Kunchang Li
Yali Wang
Yinan He
Yizhuo Li
Yi Wang
...
Jilan Xu
Guo Chen
Ping Luo
Limin Wang
Yu Qiao
VLMMLLM
143
503
0
28 Nov 2023
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning
  Benchmark for Expert AGI
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue
Yuansheng Ni
Kai Zhang
Tianyu Zheng
Ruoqi Liu
...
Yibo Liu
Wenhao Huang
Huan Sun
Yu-Chuan Su
Wenhu Chen
OSLMELMVLM
251
957
0
27 Nov 2023
HallusionBench: An Advanced Diagnostic Suite for Entangled Language
  Hallucination and Visual Illusion in Large Vision-Language Models
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
Tianrui Guan
Fuxiao Liu
Xiyang Wu
Ruiqi Xian
Zongxia Li
...
Lichang Chen
Furong Huang
Yaser Yacoob
Dinesh Manocha
Dinesh Manocha
VLMMLLM
105
195
0
23 Oct 2023
Emotional Listener Portrait: Neural Listener Head Generation with
  Emotion
Emotional Listener Portrait: Neural Listener Head Generation with Emotion
Luchuan Song
Guojun Yin
Zhenchao Jin
Xiaoyi Dong
Chenliang Xu
87
12
0
29 Sep 2023
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu
Zhengyuan Yang
Linjie Li
Jianfeng Wang
Kevin Qinghong Lin
Zicheng Liu
Xinchao Wang
Lijuan Wang
MLLM
107
716
0
04 Aug 2023
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li
Rui Wang
Guangzhi Wang
Yuying Ge
Yixiao Ge
Ying Shan
MLLMELM
119
567
0
30 Jul 2023
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench: Is Your Multi-modal Model an All-around Player?
Yuanzhan Liu
Haodong Duan
Yuanhan Zhang
Yue Liu
Songyang Zhang
...
Jiaqi Wang
Conghui He
Ziwei Liu
Kai-xiang Chen
Dahua Lin
114
1,055
0
12 Jul 2023
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language
  Models
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu
Peixian Chen
Yunhang Shen
Yulei Qin
Mengdan Zhang
...
Xiawu Zheng
Ke Li
Xing Sun
Zhenyu Qiu
Rongrong Ji
ELMMLLM
111
856
0
23 Jun 2023
Evaluating Object Hallucination in Large Vision-Language Models
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li
Yifan Du
Kun Zhou
Jinpeng Wang
Wayne Xin Zhao
Ji-Rong Wen
MLLMLRM
298
809
0
17 May 2023
Caption Anything: Interactive Image Description with Diverse Multimodal
  Controls
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
Teng Wang
Jinrui Zhang
Junjie Fei
Hao Zheng
Yunlong Tang
Zhe Li
Mingqi Gao
Shanshan Zhao
MLLM
156
89
0
04 May 2023
Egocentric Audio-Visual Object Localization
Egocentric Audio-Visual Object Localization
Chao Huang
Yapeng Tian
Anurag Kumar
Chenliang Xu
EgoV
48
35
0
23 Mar 2023
GPT-4 Technical Report
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAGMLLM
1.5K
14,699
0
15 Mar 2023
AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene
  Synthesis
AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis
Susan Liang
Chao Huang
Yapeng Tian
Anurag Kumar
Chenliang Xu
VGen
79
37
0
04 Feb 2023
GRIT: General Robust Image Task Benchmark
GRIT: General Robust Image Task Benchmark
Tanmay Gupta
Ryan Marten
Aniruddha Kembhavi
Derek Hoiem
VLMOODObjD
58
33
0
28 Apr 2022
Winoground: Probing Vision and Language Models for Visio-Linguistic
  Compositionality
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Tristan Thrush
Ryan Jiang
Max Bartolo
Amanpreet Singh
Adina Williams
Douwe Kiela
Candace Ross
CoGe
106
427
0
07 Apr 2022
Deep vanishing point detection: Geometric priors make dataset variations
  vanish
Deep vanishing point detection: Geometric priors make dataset variations vanish
Yancong Lin
R. Wiersma
S. Pintea
Klaus Hildebrandt
E. Eisemann
Jan van Gemert
AAML3DPC
80
24
0
16 Mar 2022
DocVQA: A Dataset for VQA on Document Images
DocVQA: A Dataset for VQA on Document Images
Minesh Mathew
Dimosthenis Karatzas
C. V. Jawahar
144
743
0
01 Jul 2020
Deep Hough Transform for Semantic Line Detection
Deep Hough Transform for Semantic Line Detection
Kai Zhao
Qi Han
Chang-Bin Zhang
Jun Xu
Mingg-Ming Cheng
55
181
0
10 Mar 2020
NeurVPS: Neural Vanishing Point Scanning via Conic Convolution
NeurVPS: Neural Vanishing Point Scanning via Conic Convolution
Yichao Zhou
Haozhi Qi
Jingwei Huang
Yi-An Ma
3DPC
69
44
0
14 Oct 2019
OK-VQA: A Visual Question Answering Benchmark Requiring External
  Knowledge
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
Kenneth Marino
Mohammad Rastegari
Ali Farhadi
Roozbeh Mottaghi
117
1,090
0
31 May 2019
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary
  Visual Reasoning
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
Justin Johnson
B. Hariharan
Laurens van der Maaten
Li Fei-Fei
C. L. Zitnick
Ross B. Girshick
CoGe
313
2,387
0
20 Dec 2016
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for
  Richer Image-to-Sentence Models
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Julia Hockenmaier
Svetlana Lazebnik
205
2,072
0
19 May 2015
VQA: Visual Question Answering
VQA: Visual Question Answering
Aishwarya Agrawal
Jiasen Lu
Stanislaw Antol
Margaret Mitchell
C. L. Zitnick
Dhruv Batra
Devi Parikh
CoGe
217
5,503
0
03 May 2015
Microsoft COCO: Common Objects in Context
Microsoft COCO: Common Objects in Context
Nayeon Lee
Michael Maire
Serge J. Belongie
Lubomir Bourdev
Ross B. Girshick
James Hays
Pietro Perona
Deva Ramanan
C. L. Zitnick
Piotr Dollár
ObjD
424
43,814
0
01 May 2014
1