Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2304.08485
Cited By
Visual Instruction Tuning
17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDa
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Visual Instruction Tuning"
50 / 3,278 papers shown
Title
RoboDesign1M: A Large-scale Dataset for Robot Design Understanding
T. H. Le
T. H. Nguyen
Quang-Dieu Tran
Quang Minh Nguyen
Baoru Huang
Hoan Nguyen
M. Vu
Tung D. Ta
A. Nguyen
3DV
86
0
0
09 Mar 2025
Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving
Enming Zhang
Peizhe Gong
Xingyuan Dai
Yisheng Lv
Qinghai Miao
MLLM
ELM
70
2
0
09 Mar 2025
PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training
Cong Chen
Mingyu Liu
Chenchen Jing
Y. Zhou
Fengyun Rao
Hao Chen
Bo Zhang
Chunhua Shen
MLLM
AAML
VLM
67
5
0
09 Mar 2025
Does Acceleration Cause Hidden Instability in Vision Language Models? Uncovering Instance-Level Divergence Through a Large-Scale Empirical Study
Yizheng Sun
Hao Li
Chang Xu
Hongpeng Zhou
Chenghua Lin
Riza Batista-Navarro
Jingyuan Sun
65
0
0
09 Mar 2025
Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine Translation
Yingfeng Luo
Tong Zheng
Yongyu Mu
Yangqiu Song
Qinghong Zhang
...
Ziqiang Xu
Peinan Feng
Xiaoqian Liu
Tong Xiao
Jingbo Zhu
AI4CE
272
0
0
09 Mar 2025
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang
Bohan Jia
Zijie Zhai
Shaosheng Cao
Zheyu Ye
Fei Zhao
Zhe Xu
Yao Hu
Shaohui Lin
MU
OffRL
LRM
MLLM
ReLM
VLM
66
47
0
09 Mar 2025
DiffCLIP: Differential Attention Meets CLIP
Hasan Hammoud
Guohao Li
VLM
46
0
0
09 Mar 2025
Statistical Study of Sensor Data and Investigation of ML-based Calibration Algorithms for Inexpensive Sensor Modules: Experiments from Cape Point
Travis Barrett
Amit Kumar Mishra
47
1
0
09 Mar 2025
What's in a Latent? Leveraging Diffusion Latent Space for Domain Generalization
Xavier Thomas
Deepti Ghadiyaram
DiffM
92
0
0
09 Mar 2025
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding
Seil Kang
Jinyeong Kim
Junhyeok Kim
Seong Jae Hwang
VLM
96
2
0
08 Mar 2025
Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models
Md Azim Khan
A. Gangopadhyay
Jianwu Wang
Robert F. Erbacher
VLM
59
0
0
08 Mar 2025
GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images
Xiang Lan
Feng Wu
Kai He
Qinghao Zhao
Shenda Hong
Mengling Feng
AI4TS
66
3
0
08 Mar 2025
Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?
Kun Xiang
Zhili Liu
Zihao Jiang
Yunshuang Nie
Kaixin Cai
...
Yu-Jie Yuan
Jiawei Han
Lanqing Hong
Hang Xu
Xiaodan Liang
ReLM
LRM
70
7
0
08 Mar 2025
SplatTalk: 3D VQA with Gaussian Splatting
Anh Thai
Songyou Peng
Kyle Genova
Leonidas J. Guibas
Thomas Funkhouser
3DGS
88
0
0
08 Mar 2025
Advancing Autonomous Vehicle Intelligence: Deep Learning and Multimodal LLM for Traffic Sign Recognition and Robust Lane Detection
Chandan Kumar Sah
Ankit Kumar Shaw
Xiaoli Lian
Arsalan Shahid Baig
Tuopu Wen
Kun Jiang
Mengmeng Yang
Ke Wang
44
1
0
08 Mar 2025
VLForgery Face Triad: Detection, Localization and Attribution via Multimodal Large Language Models
Xinan He
Yue Zhou
Bing Fan
Bin Li
Guopu Zhu
Feng Ding
77
1
0
08 Mar 2025
GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices
Xudong Lu
Yinghao Chen
Renshou Wu
Haohao Gao
Xi Chen
...
Fangyuan Li
Yafei Wen
Xiaoxin Chen
Shuai Ren
Hongsheng Li
89
0
0
08 Mar 2025
Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices
Junyan Lin
Haoran Chen
Yue Fan
Yingqi Fan
Xin Jin
Hui Su
Jinlan Fu
Xiaoyu Shen
68
0
0
08 Mar 2025
Treble Counterfactual VLMs: A Causal Approach to Hallucination
Li Li
Jiashu Qu
Yuxiao Zhou
Yuehan Qin
Tiankai Yang
Yue Zhao
98
2
0
08 Mar 2025
CASP: Compression of Large Multimodal Models Based on Attention Sparsity
Mohsen Gholami
Mohammad Akbari
Kevin Cannons
Yong Zhang
65
0
0
07 Mar 2025
Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions
Chan hur
Jeong-hun Hong
Dong-hun Lee
Dabin Kang
Semin Myeong
Sang-hyo Park
Hyeyoung Park
66
0
0
07 Mar 2025
MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice
Hongwei Yi
Tian Ye
Shitong Shao
Xuancheng Yang
Jiantong Zhao
...
Zeke Xie
Lei Zhu
Wei Li
Michael Lingelbach
Daquan Zhou
VGen
60
1
0
07 Mar 2025
GEMA-Score: Granular Explainable Multi-Agent Score for Radiology Report Evaluation
Zhenxuan Zhang
Kinhei Lee
Weihang Deng
Huichi Zhou
Zihao Jin
Jiahao Huang
Zhifan Gao
D. C. Marshall
Yingying Fang
G. Yang
MedIm
49
1
0
07 Mar 2025
Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy
Ruixi Lin
Ziqiao Wang
Yang You
FaML
89
1
0
07 Mar 2025
Pi-GPS: Enhancing Geometry Problem Solving by Unleashing the Power of Diagrammatic Information
Junbo Zhao
Ting Zhang
Jiayu Sun
Mi Tian
Hua Huang
38
1
0
07 Mar 2025
Unified Reward Model for Multimodal Understanding and Generation
Yibin Wang
Yuhang Zang
Hao Li
Cheng Jin
Rongxiang Weng
EGVM
81
5
0
07 Mar 2025
The Challenge of Identifying the Origin of Black-Box Large Language Models
Ziqing Yang
Yixin Wu
Yun Shen
Wei Dai
Michael Backes
Yang Zhang
AAML
47
0
0
06 Mar 2025
Knowledge-Decoupled Synergetic Learning: An MLLM based Collaborative Approach to Few-shot Multimodal Dialogue Intention Recognition
Bin Chen
Yu Zhang
Hongfei Ye
Ziyi Huang
Hongyang Chen
65
1
0
06 Mar 2025
Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation
Aishik Konwer
Zhijian Yang
Erhan Bas
Cao Xiao
Prateek Prasanna
Parminder Bhatia
Taha A. Kass-Hout
MedIm
VLM
78
0
0
06 Mar 2025
PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks
Feng Ni
Kui Huang
Yao Lu
Wenyu Lv
Guanzhong Wang
Zeyu Chen
Yong-Jin Liu
VLM
60
0
0
06 Mar 2025
DSV-LFS: Unifying LLM-Driven Semantic Cues with Visual Features for Robust Few-Shot Segmentation
Amin Karimi
Charalambos Poullis
VLM
58
0
0
06 Mar 2025
SHAPE : Self-Improved Visual Preference Alignment by Iteratively Generating Holistic Winner
Kejia Chen
Jiawen Zhang
Jiacong Hu
Jiazhen Yang
Jian Lou
Zunlei Feng
Mingli Song
74
0
0
06 Mar 2025
Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model
Wenke Huang
Jian Liang
Xianda Guo
Yiyang Fang
Guancheng Wan
...
Bin Yang
He Li
Jiawei Shao
Mang Ye
Bo Du
OffRL
LRM
MLLM
KELM
VLM
67
1
0
06 Mar 2025
CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP
Songlong Xing
Zhengyu Zhao
N. Sebe
AAML
64
1
0
05 Mar 2025
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles
Rui Zhao
Weijia Mao
Mike Zheng Shou
71
0
0
05 Mar 2025
See What You Are Told: Visual Attention Sink in Large Multimodal Models
Seil Kang
Jinyeong Kim
Junhyeok Kim
Seong Jae Hwang
VLM
115
5
0
05 Mar 2025
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning
Borong Zhang
Yuhao Zhang
Yalan Qin
Yingshan Lei
Josef Dai
Yuanpei Chen
Yaodong Yang
66
4
0
05 Mar 2025
Task-Agnostic Attacks Against Vision Foundation Models
Brian Pulfer
Yury Belousov
Vitaliy Kinakh
Teddy Furon
S. Voloshynovskiy
AAML
77
0
0
05 Mar 2025
LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant
Wei Li
Bing Hu
Rui Shao
Leyang Shen
Liqiang Nie
52
2
0
05 Mar 2025
OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction
Huang Huang
Fangchen Liu
Letian Fu
Tingfan Wu
Mustafa Mukadam
Jitendra Malik
Ken Goldberg
Pieter Abbeel
LM&Ro
VLM
90
6
0
05 Mar 2025
Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks
Liming Lu
Shuchao Pang
Siyuan Liang
Haotian Zhu
Xiyu Zeng
Aishan Liu
Yunhuai Liu
Yongbin Zhou
AAML
58
2
0
05 Mar 2025
EgoLife: Towards Egocentric Life Assistant
Jingkang Yang
Shuai Liu
Hongming Guo
Yuhao Dong
X. Zhang
...
Joerg Widmer
Francesco Gringoli
Lei Yang
Bo Li
Ziwei Liu
EgoV
71
2
0
05 Mar 2025
Are Large Vision Language Models Good Game Players?
Xinyu Wang
Bohan Zhuang
Qi Wu
MLLM
ELM
LRM
104
4
0
04 Mar 2025
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
Saeed Ranjbar Alvar
Gursimran Singh
Mohammad Akbari
Yong Zhang
VLM
82
0
0
04 Mar 2025
Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding
Wenxuan Song
Jiayi Chen
Pengxiang Ding
Han Zhao
Wei Zhao
Zhide Zhong
Zongyuan Ge
Jun Ma
Haoang Li
57
3
0
04 Mar 2025
Text2Scenario: Text-Driven Scenario Generation for Autonomous Driving Test
Xuan Cai
Xuesong Bai
Zhiyong Cui
Danmu Xie
Daocheng Fu
Haiyang Yu
Yilong Ren
44
0
0
04 Mar 2025
LLM-Safety Evaluations Lack Robustness
Tim Beyer
Sophie Xhonneux
Simon Geisler
Gauthier Gidel
Leo Schwinn
Stephan Günnemann
ALM
ELM
278
0
0
04 Mar 2025
MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments
Ege Özsoy
Chantal Pellegrini
Tobias Czempiel
Felix Tristram
Kun Yuan
D. Bani-Harouni
U. Eck
Benjamin Busam
Matthias Keicher
Nassir Navab
90
2
0
04 Mar 2025
GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning
Zhun Mou
Bin Xia
Zhengchao Huang
Wenming Yang
Jiaya Jia
VGen
ELM
LRM
73
0
0
04 Mar 2025
Towards Improved Text-Aligned Codebook Learning: Multi-Hierarchical Codebook-Text Alignment with Long Text
Guotao Liang
Baoquan Zhang
Zhiyuan Wen
Junteng Zhao
Yunming Ye
Kola Ye
Yao He
62
0
0
03 Mar 2025
Previous
1
2
3
...
13
14
15
...
64
65
66
Next