ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.14135
  4. Cited By
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs

V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs

21 December 2023
Penghao Wu
Saining Xie
    LRM
ArXivPDFHTML

Papers citing "V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs"

50 / 97 papers shown
Title
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
Zhaochen Su
Linjie Li
Mingyang Song
Yunzhuo Hao
Zhengyuan Yang
...
Guanjie Chen
Jiawei Gu
Juntao Li
Xiaoye Qu
Yu Cheng
OffRL
LRM
23
0
0
13 May 2025
DSADF: Thinking Fast and Slow for Decision Making
DSADF: Thinking Fast and Slow for Decision Making
Alex Zhihao Dou
Dongfei Cui
Jun Yan
W. Wang
Benteng Chen
Haoming Wang
Zeke Xie
Shufei Zhang
OffRL
29
0
0
13 May 2025
SITE: towards Spatial Intelligence Thorough Evaluation
SITE: towards Spatial Intelligence Thorough Evaluation
W. Wang
Reuben Tan
Pengyue Zhu
Jianwei Yang
Zhengyuan Yang
Lijuan Wang
Andrey Kolobov
Jianfeng Gao
Boqing Gong
43
0
0
08 May 2025
Grounding Task Assistance with Multimodal Cues from a Single Demonstration
Grounding Task Assistance with Multimodal Cues from a Single Demonstration
Gabriel Sarch
Balasaravanan Thoravi Kumaravel
Sahithya Ravi
Vibhav Vineet
A. D. Wilson
120
0
0
02 May 2025
Zoomer: Adaptive Image Focus Optimization for Black-box MLLM
Zoomer: Adaptive Image Focus Optimization for Black-box MLLM
Jiaxu Qian
Chendong Wang
Y. Yang
Chaoyun Zhang
Huiqiang Jiang
...
Saravan Rajmohan
Dongmei Zhang
Y. Yang
Qi Zhang
Lili Qiu
VLM
76
0
0
30 Apr 2025
Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models
Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models
Guanghao Zhou
Panjia Qiu
C. L. P. Chen
J. Wang
Zheming Yang
Jian Xu
Minghui Qiu
OffRL
LRM
53
0
0
30 Apr 2025
Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
Kesen Zhao
B. Zhu
Qianru Sun
Hanwang Zhang
MLLM
LRM
81
0
0
25 Apr 2025
HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?
HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?
Yusen Zhang
Wenliang Zheng
Aashrith Madasu
Peng Shi
Ryo Kamoi
...
Ranran Haoran Zhang
Avitej Iyer
Renze Lou
Wenpeng Yin
Rui Zhang
63
0
0
25 Apr 2025
V$^2$R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations
V2^22R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations
Zhiyuan Fan
Yumeng Wang
Sandeep Polisetty
Yi Ren Fung
45
0
0
23 Apr 2025
DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding
DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding
Geng Li
Jinglin Xu
Yunzhen Zhao
Yuxin Peng
ObjD
27
0
0
21 Apr 2025
AGI Is Coming... Right After AI Learns to Play Wordle
AGI Is Coming... Right After AI Learns to Play Wordle
Sarath Shekkizhar
Romain Cosentino
LLMAG
40
0
0
21 Apr 2025
LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception
LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception
Yuan-Hong Liao
Sven Elflein
Liu He
Laura Leal-Taixe
Yejin Choi
Sanja Fidler
David Acuna
ReLM
LRM
VLM
91
0
0
21 Apr 2025
Towards Explainable Fake Image Detection with Multi-Modal Large Language Models
Towards Explainable Fake Image Detection with Multi-Modal Large Language Models
Yikun Ji
Y. Hong
Jiahui Zhan
H. Chen
Jun Lan
Huijia Zhu
Weiqiang Wang
L. Zhang
Jianfu Zhang
MLLM
LRM
46
0
0
19 Apr 2025
Perception in Reflection
Perception in Reflection
Yana Wei
Liang Zhao
Kangheng Lin
En Yu
Yuang Peng
...
Jianjian Sun
Haoran Wei
Zheng Ge
Xiangyu Zhang
Vishal M. Patel
31
0
0
09 Apr 2025
TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection
TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection
C. Xie
Tongxuan Liu
Lei Jiang
Yuting Zeng
J. Guo
Yunheng Shen
Weizhe Huang
Jing Li
Xiaohua Xu
VLM
56
0
0
05 Apr 2025
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
Kaixin Li
Ziyang Meng
Hongzhan Lin
Ziyang Luo
Yuchen Tian
Jing Ma
Zhiyong Huang
Tat-Seng Chua
32
7
0
04 Apr 2025
JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model
JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model
Yi Nian
Shenzhe Zhu
Yuehan Qin
Li Li
Z. Wang
Chaowei Xiao
Yue Zhao
21
0
0
03 Apr 2025
TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding
TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding
Junwen Pan
Rui Zhang
Xin Wan
Yuan Zhang
Ming Lu
Qi She
VLM
36
1
0
02 Apr 2025
Online Reasoning Video Segmentation with Just-in-Time Digital Twins
Online Reasoning Video Segmentation with Just-in-Time Digital Twins
Yiqing Shen
Bohan Liu
Chenjia Li
Lalithkumar Seenivasan
Mathias Unberath
VOS
75
2
0
27 Mar 2025
Operating Room Workflow Analysis via Reasoning Segmentation over Digital Twins
Operating Room Workflow Analysis via Reasoning Segmentation over Digital Twins
Yiqing Shen
Chenjia Li
Bohan Liu
Cheng-Yi Li
Tito Porras
Mathias Unberath
54
2
0
26 Mar 2025
Scaling Vision Pre-Training to 4K Resolution
Scaling Vision Pre-Training to 4K Resolution
Baifeng Shi
Boyi Li
Han Cai
Y. Lu
Sifei Liu
...
Jan Kautz
Song Han
Trevor Darrell
Pavlo Molchanov
Hongxu Yin
CLIP
104
0
0
25 Mar 2025
FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs
FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs
Carlos Plou
Cesar Borja
Ruben Martinez-Cantin
Ana C. Murillo
56
0
0
25 Mar 2025
LLaVAction: evaluating and training multi-modal large language models for action recognition
LLaVAction: evaluating and training multi-modal large language models for action recognition
Shaokai Ye
Haozhe Qi
Alexander Mathis
Mackenzie W. Mathis
60
1
0
24 Mar 2025
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
Shravan Nayak
Xiangru Jian
Kevin Qinghong Lin
Juan A. Rodriguez
Montek Kalsi
...
David Vazquez
Christopher Pal
Perouz Taslakian
Spandana Gella
Sai Rajeswar
151
0
0
19 Mar 2025
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
Weiyu Guo
Ziyang Chen
Shaoguang Wang
JianXiang He
Yijie Xu
Jinhui Ye
Ying Sun
Hui Xiong
44
1
0
17 Mar 2025
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
Jing Bi
Junjia Guo
Susan Liang
Guangyu Sun
Luchuan Song
...
Jinxi He
Jiarui Wu
A. Vosoughi
C. L. P. Chen
Chenliang Xu
LRM
69
1
0
14 Mar 2025
Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection
Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection
Bangzheng Li
Fei-Yue Wang
Wenxuan Zhou
Nan Xu
Ben Zhou
Sheng Zhang
Hoifung Poon
M. Chen
MLLM
VLM
84
0
0
14 Mar 2025
Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization
Zongshang Pang
Mayu Otani
Yuta Nakashima
51
0
0
12 Mar 2025
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
Junwei Luo
Yingying Zhang
X. J. Yang
Kang Wu
Qi Zhu
Lei Liang
Jingdong Chen
Yansheng Li
62
0
0
10 Mar 2025
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
Zhangquan Chen
Xufang Luo
Dongsheng Li
OffRL
LRM
64
3
0
10 Mar 2025
VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering
Yanling Wang
Yihan Zhao
Xiaodong Chen
Shasha Guo
Lixin Liu
Haoyang Li
Yong Xiao
J. Zhang
Qi Li
Ke Xu
42
1
0
09 Mar 2025
Out-of-Distribution Radar Detection in Compound Clutter and Thermal Noise through Variational Autoencoders
Y A Rouzoumka
E Terreaux
C. Morisseau
J. Ovarlez
C. Ren
46
2
0
06 Mar 2025
Vision-Language Models Struggle to Align Entities across Modalities
Iñigo Alonso
Ander Salaberria
Gorka Azkune
Jeremy Barnes
Oier López de Lacalle
VLM
56
0
0
05 Mar 2025
Exploring and Evaluating Multimodal Knowledge Reasoning Consistency of Multimodal Large Language Models
Boyu Jia
Junzhe Zhang
Huixuan Zhang
Xiaojun Wan
LRM
44
1
0
03 Mar 2025
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
Jiarui Zhang
Mahyar Khayatkhoei
P. Chhikara
Filip Ilievski
LRM
39
6
0
24 Feb 2025
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models
Qianqi Yan
Yue Fan
Hongquan Li
Shan Jiang
Yang Zhao
Xinze Guan
Ching-Chen Kuo
X. Wang
VLM
LRM
60
2
0
22 Feb 2025
NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization
NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization
Zheyuan Zhang
Runze Li
Tasnim Kabir
Jordan Boyd-Graber
46
0
0
21 Feb 2025
Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study
Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study
Yujie Lin
Ante Wang
Moye Chen
Jingyao Liu
Hao Liu
Jinsong Su
Xinyan Xiao
LRM
48
2
0
17 Feb 2025
CORDIAL: Can Multimodal Large Language Models Effectively Understand Coherence Relationships?
CORDIAL: Can Multimodal Large Language Models Effectively Understand Coherence Relationships?
Aashish Anantha Ramakrishnan
Aadarsh Anantha Ramakrishnan
Dongwon Lee
47
1
0
16 Feb 2025
Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering
Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering
Junxiao Xue
Quan Deng
Fei Yu
Yanhao Wang
Jun Wang
Y. Li
VLM
41
3
0
31 Dec 2024
A Review of Multimodal Explainable Artificial Intelligence: Past,
  Present and Future
A Review of Multimodal Explainable Artificial Intelligence: Past, Present and Future
Shilin Sun
Wenbin An
Feng Tian
Fang Nan
Qidong Liu
J. Liu
N. Shah
Ping Chen
83
2
0
18 Dec 2024
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
Renqiu Xia
M. Li
Hancheng Ye
Wenjie Wu
Hongbin Zhou
...
Conghui He
Botian Shi
Tao Chen
Junchi Yan
Bo Zhang
82
7
0
16 Dec 2024
Enhancing Visual Reasoning with Autonomous Imagination in Multimodal
  Large Language Models
Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models
J. Liu
Yumeng Li
Boyuan Xiao
Yichang Jian
Ziang Qin
Tianjia Shao
Yao-Xiang Ding
Kun Zhou
MLLM
LRM
95
3
0
27 Nov 2024
VLRewardBench: A Challenging Benchmark for Vision-Language Generative
  Reward Models
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
Lei Li
Y. X. Wei
Zhihui Xie
Xuqing Yang
Yifan Song
...
Tianyu Liu
Sujian Li
Bill Yuchen Lin
Lingpeng Kong
Q. Liu
CoGe
VLM
115
24
0
26 Nov 2024
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities
  through Tree-Based Image Exploration
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
Haozhan Shen
Kangjia Zhao
Tiancheng Zhao
Ruochen Xu
Zilun Zhang
Mingwei Zhu
Jianwei Yin
87
4
0
25 Nov 2024
VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding
VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding
Jiaqi Wang
Yifei Gao
Jitao Sang
MLLM
113
2
0
24 Nov 2024
ORID: Organ-Regional Information Driven Framework for Radiology Report
  Generation
ORID: Organ-Regional Information Driven Framework for Radiology Report Generation
Tiancheng Gu
Kaicheng Yang
Xiang An
Ziyong Feng
Dongnan Liu
Weidong Cai
69
1
0
20 Nov 2024
YOLO-RD: Introducing Relevant and Compact Explicit Knowledge to YOLO by Retriever-Dictionary
YOLO-RD: Introducing Relevant and Compact Explicit Knowledge to YOLO by Retriever-Dictionary
Hao-Tang Tsui
Chien-Yao Wang
H. Liao
ObjD
VLM
46
0
0
20 Oct 2024
LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound
LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound
Xuechen Guo
Wenhao Chai
Shi-Yan Li
Gaoang Wang
31
6
0
19 Oct 2024
ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models
ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models
Ziyue Wang
Chi Chen
Fuwen Luo
Yurui Dong
Yuanchi Zhang
Yuzhuang Xu
Xiaolong Wang
Peng Li
Yang Liu
LRM
35
3
0
07 Oct 2024
12
Next