Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2304.08485
Cited By
Visual Instruction Tuning
17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDa
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Visual Instruction Tuning"
50 / 3,278 papers shown
Title
Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas
Shiqi Chen
Tongyao Zhu
Ruochen Zhou
Jinghan Zhang
Siyang Gao
Juan Carlos Niebles
Mor Geva
Junxian He
Jiajun Wu
Manling Li
LRM
60
0
0
03 Mar 2025
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Abdelrahman Abouelenin
Atabak Ashfaq
Adam Atkinson
Hany Awadalla
Nguyen Bach
...
Ishmam Zabir
Yunan Zhang
Li Zhang
Wenjie Qu
Xiren Zhou
MoE
SyDa
78
32
0
03 Mar 2025
Advancing vision-language models in front-end development via data synthesis
Tong Ge
Yashu Liu
Jieping Ye
Tianyi Li
Chao Wang
78
0
0
03 Mar 2025
A Zero-Shot Learning Approach for Ephemeral Gully Detection from Remote Sensing using Vision Language Models
Seyed Mohamad Ali Tousi
Ramy M. A. Farag
Jacket Demby's
Gbenga Omotara
John A. Lory
Guilherme N. DeSouza
250
0
0
03 Mar 2025
Enhancing Retinal Vessel Segmentation Generalization via Layout-Aware Generative Modelling
Jonathan Fhima
Jan Van Eijgen
Lennert Beeckmans
Thomas Jacobs
Moti Freiman
Luis Filipe Nakayama
Ingeborg Stalmans
Chaim Baskin
Joachim A. Behar
MedIm
74
0
0
03 Mar 2025
MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation
Yi Wang
Mushui Liu
Wanggui He
Longxiang Zhang
Z. Huang
...
Haoyang Li
Weilong Dai
Mingli Song
Jie Song
Hao Jiang
MLLM
MoE
LRM
93
1
0
03 Mar 2025
WeGen: A Unified Model for Interactive Multimodal Generation as We Chat
Zhipeng Huang
Shaobin Zhuang
Canmiao Fu
Binxin Yang
Ying Zhang
Chong Sun
Zhizheng Zhang
Yali Wang
Chen Li
Zheng-Jun Zha
DiffM
74
2
0
03 Mar 2025
Parameter-free Video Segmentation for Vision and Language Understanding
Louis Mahon
Mirella Lapata
VLM
46
2
0
03 Mar 2025
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface
Hao Tang
Chenwei Xie
Haiyang Wang
Xiaoyi Bao
Tingyu Weng
Pandeng Li
Yun Zheng
Liwei Wang
ObjD
VLM
64
0
0
03 Mar 2025
Watch Out Your Album! On the Inadvertent Privacy Memorization in Multi-Modal Large Language Models
Tianjie Ju
Yi Hua
Hao Fei
Zhenyu Shao
Yubin Zheng
Haodong Zhao
Mong Li Lee
Wynne Hsu
Zhuosheng Zhang
Gongshen Liu
65
0
0
03 Mar 2025
Re-Imagining Multimodal Instruction Tuning: A Representation View
Yiyang Liu
James Liang
Ruixiang Tang
Yugyung Lee
Majid Rabbani
...
Raghuveer M. Rao
Lifu Huang
Dongfang Liu
Qifan Wang
Cheng Han
228
0
0
02 Mar 2025
Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models
Dilxat Muhtar
Enzhuo Zhang
Zhenshi Li
Feng-Xue Gu
Yanglangxing He
Pengfeng Xiao
Xueliang Zhang
58
3
0
02 Mar 2025
CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering
Tianyu Huai
Jie Zhou
Xingjiao Wu
Qin Chen
Qingchun Bai
Ze Zhou
Liang He
MoE
45
2
0
01 Mar 2025
Urban Safety Perception Through the Lens of Large Multimodal Models: A Persona-based Approach
Ciro Beneduce
Bruno Lepri
Massimiliano Luca
44
0
0
01 Mar 2025
Solving Instance Detection from an Open-World Perspective
Qianqian Shen
Yunhan Zhao
Nahyun Kwon
Jeeeun Kim
Yanan Li
Shu Kong
48
0
0
01 Mar 2025
AesthetiQ: Enhancing Graphic Layout Design via Aesthetic-Aware Preference Alignment of Multi-modal Large Language Models
Sohan Patnaik
Rishabh Jain
Balaji Krishnamurthy
Mausoom Sarkar
41
0
0
01 Mar 2025
Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning
Hanxun Yu
Wentong Li
Song Wang
Jintai Chen
Jianke Zhu
3DV
LRM
86
3
0
01 Mar 2025
Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding
Wei Suo
Lijun Zhang
Mengyang Sun
Lin Yuanbo Wu
Peng Wang
Yuyao Zhang
MLLM
VLM
52
1
0
01 Mar 2025
RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete
Yuheng Ji
Huajie Tan
Jiayu Shi
Xiaoshuai Hao
Yuan Zhang
...
Huaihai Lyu
Xiaolong Zheng
Jiaming Liu
Zhongyuan Wang
Shanghang Zhang
102
8
0
28 Feb 2025
Towards General Visual-Linguistic Face Forgery Detection(V2)
Ke Sun
Shen Chen
Taiping Yao
Ziyin Zhou
Jiayi Ji
Xiaoshuai Sun
Chia-Wen Lin
Rongrong Ji
69
2
0
28 Feb 2025
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts
Peijie Wang
Zhong-Zhi Li
Fei Yin
Xin Yang
Dekang Ran
Cheng-Lin Liu
LRM
52
5
0
28 Feb 2025
Adaptive Keyframe Sampling for Long Video Understanding
Xi Tang
Jihao Qiu
Lingxi Xie
Yunjie Tian
Jianbin Jiao
Qixiang Ye
90
0
0
28 Feb 2025
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
L. Chen
S. Bai
Wenhao Chai
Weichu Xie
Haozhe Zhao
Leon Vinci
Junyang Lin
Baobao Chang
DiffM
92
4
0
27 Feb 2025
ChatReID: Open-ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models
Ke Niu
Haiyang Yu
Mengyang Zhao
Teng Fu
Siyang Yi
Wei Lu
Bin Li
X. Qian
Xiangyang Xue
88
2
0
27 Feb 2025
Analyzing CLIP's Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study
Reza Abbasi
Ali Nazari
Aminreza Sefid
Mohammadali Banayeeanzade
M. Rohban
M. Baghshah
VLM
64
1
0
27 Feb 2025
Large Language Models as Attribution Regularizers for Efficient Model Training
Davor Vukadin
Marin Šilić
Goran Delač
53
0
0
27 Feb 2025
Knowledge Bridger: Towards Training-free Missing Multi-modality Completion
Guanzhou Ke
Shengfeng He
Xinyu Wang
Bo Wang
Guoqing Chao
Yuyao Zhang
Yi Xie
HeXing Su
73
0
0
27 Feb 2025
MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge
Yuntao Du
Kailin Jiang
Zhi Gao
Chenrui Shi
Zilong Zheng
Siyuan Qi
Qing Li
KELM
79
2
0
27 Feb 2025
Improving Adversarial Transferability in MLLMs via Dynamic Vision-Language Alignment Attack
Chenhe Gu
Jindong Gu
Andong Hua
Yao Qin
AAML
52
0
0
27 Feb 2025
Interpreting CLIP with Hierarchical Sparse Autoencoders
Vladimir Zaigrajew
Hubert Baniecki
P. Biecek
56
0
0
27 Feb 2025
Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios
Chao Wang
Luning Zhang
Ziyi Wang
Yang Zhou
ELM
VLM
LRM
63
1
0
27 Feb 2025
Towards Statistical Factuality Guarantee for Large Vision-Language Models
Zechao Li
Chao Yan
Nicholas J. Jackson
Wendi Cui
B. Li
Jiaxin Zhang
Bradley Malin
76
0
0
27 Feb 2025
SCA3D: Enhancing Cross-modal 3D Retrieval via 3D Shape and Caption Paired Data Augmentation
Junlong Ren
Hao Wu
Hui Xiong
Haoran Wang
73
0
0
26 Feb 2025
ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration
Minjie Zhu
Bo Li
Jinming Li
Zhongyi Zhou
Junjie Wen
Xiaoyu Liu
Yaxin Peng
Chaomin Shen
Feifei Feng
LM&Ro
91
4
0
26 Feb 2025
Pathology Report Generation and Multimodal Representation Learning for Cutaneous Melanocytic Lesions
R. Lucassen
Sander P.J. Moonemans
Tijn van de Luijtgaarden
Gerben E. Breimer
W. Blokx
M. Veta
MedIm
69
2
0
26 Feb 2025
QueryAdapter: Rapid Adaptation of Vision-Language Models in Response to Natural Language Queries
N. H. Chapman
Feras Dayoub
Will N. Browne
Christopher F. Lehnert
VLM
82
0
0
26 Feb 2025
Talking to the brain: Using Large Language Models as Proxies to Model Brain Semantic Representation
Xin Liu
Zhe Zhang
Jingxin Nie
72
0
0
26 Feb 2025
On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation
R. Lucassen
Tijn van de Luijtgaarden
Sander P.J. Moonemans
Gerben E. Breimer
W. Blokx
M. Veta
69
0
0
26 Feb 2025
Task-Driven Semantic Quantization and Imitation Learning for Goal-Oriented Communications
Yu-Chieh Chao
Yubei Chen
Weiwei Wang
Achintha Wijesinghe
Suchinthaka Wanninayaka
Songyang Zhang
Zhi Ding
DiffM
81
0
0
25 Feb 2025
VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion
Pei Liu
Haipeng Liu
Haichao Liu
Xin Liu
Jinxin Ni
Jun Ma
76
1
0
25 Feb 2025
Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models
Zhaoyi Liu
Huan Zhang
AAML
88
0
0
25 Feb 2025
MMAD: A Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection
Xi Jiang
Jian Li
Hanqiu Deng
Yue Liu
Bin-Bin Gao
Yifeng Zhou
Jialin Li
Chengjie Wang
Feng Zheng
63
2
0
24 Feb 2025
Disentangling Visual Transformers: Patch-level Interpretability for Image Classification
Guillaume Jeanneret
Loïc Simon
F. Jurie
ViT
66
0
0
24 Feb 2025
MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing
Matvey Skripkin
Elizaveta Goncharova
Dmitrii Tarasov
Andrey Kuznetsov
69
0
0
24 Feb 2025
Towards Foundation Models for Mixed Integer Linear Programming
Sirui Li
Janardhan Kulkarni
Ishai Menache
Cathy Wu
Beibin Li
62
4
0
24 Feb 2025
Exploring Causes and Mitigation of Hallucinations in Large Vision Language Models
Yaqi Sun
Kyohei Atarashi
Koh Takeuchi
Hisashi Kashima
MLLM
56
0
0
24 Feb 2025
Introducing Visual Perception Token into Multimodal Large Language Model
Runpeng Yu
Xinyin Ma
Xinchao Wang
MLLM
LRM
86
0
0
24 Feb 2025
VaViM and VaVAM: Autonomous Driving through Video Generative Modeling
Florent Bartoccioni
Elias Ramzi
Victor Besnier
Shashanka Venkataramanan
Tuan-Hung Vu
...
Mickael Chen
Éloi Zablocki
Andrei Bursuc
Eduardo Valle
Matthieu Cord
VGen
88
1
0
24 Feb 2025
EigenShield: Causal Subspace Filtering via Random Matrix Theory for Adversarially Robust Vision-Language Models
Nastaran Darabi
Devashri Naik
Sina Tayebati
Dinithi Jayasuriya
Ranganath Krishnan
A. R. Trivedi
AAML
57
0
0
24 Feb 2025
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
Jiarui Zhang
Mahyar Khayatkhoei
P. Chhikara
Filip Ilievski
LRM
46
6
0
24 Feb 2025
Previous
1
2
3
...
14
15
16
...
64
65
66
Next