Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2304.08485
Cited By
Visual Instruction Tuning
17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDa
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Visual Instruction Tuning"
50 / 3,253 papers shown
Title
FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding
De-An Huang
Subhashree Radhakrishnan
Zhiding Yu
Jan Kautz
VGen
VLM
96
0
0
24 Apr 2025
Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark
Hanlei Zhang
Zhuohang Li
Yeshuang Zhu
Hua Xu
Peiwu Wang
Haige Zhu
Jie Zhou
Jinchao Zhang
43
0
0
23 Apr 2025
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Chris
Yichen Wei
Yi Peng
Xuben Wang
Weijie Qiu
...
Jianhao Zhang
Y. Hao
Xuchen Song
Yang Liu
Yahui Zhou
OffRL
AI4TS
SyDa
LRM
VLM
79
0
0
23 Apr 2025
A Survey of Foundation Model-Powered Recommender Systems: From Feature-Based, Generative to Agentic Paradigms
Chengkai Huang
Hongtao Huang
Tong Yu
Kaige Xie
Junda Wu
Shuai Zhang
Julian McAuley
Dietmar Jannach
Lina Yao
LRM
AI4CE
29
0
0
23 Apr 2025
Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes
Joan Perez
Giovanni Fusco
25
0
0
23 Apr 2025
TrustGeoGen: Scalable and Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving
Daocheng Fu
Zijun Chen
Renqiu Xia
Qi Liu
Yuan Feng
...
Peng Gao
Junchi Yan
Botian Shi
Bo Zhang
Yu Qiao
33
1
0
22 Apr 2025
Vidi: Large Multimodal Models for Video Understanding and Editing
Vidi Team
Celong Liu
Chia-Wen Kuo
Dawei Du
Fan Chen
...
Wen Zhong
Xiaohui Shen
Xin Gu
Xing Mei
Xueqiong Qu
67
0
0
22 Apr 2025
Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation
Ziqiao Ma
Jing Ding
Xuejun Zhang
Dezhi Luo
Jiahe Ding
Sihan Xu
Yuchen Huang
Run Peng
Joyce Chai
56
0
0
22 Apr 2025
AdaViP: Aligning Multi-modal LLMs via Adaptive Vision-enhanced Preference Optimization
Jinda Lu
Jinghan Li
Yuan Gao
Junkang Wu
Jiancan Wu
Xuben Wang
Xiangnan He
177
0
0
22 Apr 2025
AffordanceSAM: Segment Anything Once More in Affordance Grounding
D. Jiang
Mengmeng Wang
Teli Ma
Yiming Li
Yong-Jin Liu
Guang Dai
Lefei Zhang
32
0
0
22 Apr 2025
MR. Video: "MapReduce" is the Principle for Long Video Understanding
Ziqi Pang
Yu-xiong Wang
VLM
45
0
0
22 Apr 2025
FaceInsight: A Multimodal Large Language Model for Face Perception
Jingzhi Li
Changjiang Luo
Ruoyu Chen
Hua Zhang
Wenqi Ren
Jianhou Gan
Xiaochun Cao
CVBM
LRM
65
0
0
22 Apr 2025
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
David Ma
Wenjie Qu
J. Ren
Jarvis Guo
Yifan Yao
...
Shiwen Ni
Jing Liu
Wenhao Huang
Ge Zhang
Xiaojie Jin
VLM
42
0
0
21 Apr 2025
AGI Is Coming... Right After AI Learns to Play Wordle
Sarath Shekkizhar
Romain Cosentino
LLMAG
45
0
0
21 Apr 2025
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
Weiye Xu
Jun Wang
Weiyun Wang
Zhe Chen
Wengang Zhou
...
Xiaohua Wang
Xizhou Zhu
Wenhai Wang
Jifeng Dai
Jinguo Zhu
VLM
LRM
58
1
0
21 Apr 2025
Multimodal Large Language Models for Enhanced Traffic Safety: A Comprehensive Review and Future Trends
M. Tami
Mohammed Elhenawy
Huthaifa I. Ashqar
46
0
0
21 Apr 2025
Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
Chun-Hsiao Yeh
Chenyu Wang
Shengbang Tong
Ta-Ying Cheng
Rouyu Wang
Tianzhe Chu
Yuexiang Zhai
Yubei Chen
Shenghua Gao
Yi Ma
LRM
68
1
0
21 Apr 2025
DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding
Geng Li
Jinglin Xu
Yunzhen Zhao
Yuxin Peng
ObjD
32
0
0
21 Apr 2025
ResNetVLLM-2: Addressing ResNetVLLM's Multi-Modal Hallucinations
Ahmad Khalil
Mahmoud Khalil
A. Ngom
MLLM
VLM
50
0
0
20 Apr 2025
LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation
Jiachen Li
Qing Xie
Xiaohan Yu
Hongyun Wang
Jinyu Xu
Yongjian Liu
ObjD
81
0
0
20 Apr 2025
ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task
Ahmad Khalil
Mahmoud Khalil
A. Ngom
VLM
50
1
0
20 Apr 2025
Learning from Reasoning Failures via Synthetic Data Generation
Gabriela Ben-Melech Stan
Estelle Aflalo
Avinash Madasu
Vasudev Lal
Phillip Howard
SyDa
LRM
49
0
0
20 Apr 2025
Modality Selection and Skill Segmentation via Cross-Modality Attention
Jiawei Jiang
Kei Ota
Devesh K. Jha
Asako Kanezaki
24
0
0
20 Apr 2025
Manipulating Multimodal Agents via Cross-Modal Prompt Injection
Le Wang
Zonghao Ying
Tianyuan Zhang
Siyuan Liang
Shengshan Hu
Mingchuan Zhang
A. Liu
Xianglong Liu
AAML
33
1
0
19 Apr 2025
How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?
Rahul Thapa
Andrew Li
Qingyang Wu
B. He
Yuki Sahashi
...
Angela Zhang
Ben Athiwaratkun
Shuaiwen Leon Song
David Ouyang
James Zou
LM&MA
49
0
0
19 Apr 2025
VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment
Yogesh Kulkarni
Pooyan Fazli
38
0
0
18 Apr 2025
Scaling LLaNA: Advancing NeRF-Language Understanding Through Large-Scale Training
Andrea Amaduzzi
Pierluigi Zama Ramirez
Giuseppe Lisanti
Samuele Salti
Luigi Di Stefano
34
0
0
18 Apr 2025
Analysing the Robustness of Vision-Language-Models to Common Corruptions
Muhammad Usama
Syeda Aishah Asim
Syed Bilal Ali
Syed Talal Wasim
Umair Bin Mansoor
VLM
36
0
0
18 Apr 2025
Harmony: A Unified Framework for Modality Incremental Learning
Y. Song
Xiaoshan Yang
D. Jiang
Yaowei Wang
Changsheng Xu
CLL
50
0
0
17 Apr 2025
Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training
Xinsong Zhang
Yarong Zeng
Xinting Huang
Hu Hu
Runquan Xie
Han Hu
Zhanhui Kang
MLLM
VLM
55
0
0
17 Apr 2025
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Xiangyan Liu
Jinjie Ni
Zijian Wu
Chao Du
Longxu Dou
Haoran Wang
Tianyu Pang
Michael Shieh
OffRL
LRM
197
1
0
17 Apr 2025
Probing and Inducing Combinational Creativity in Vision-Language Models
Yongqian Peng
Yuxi Ma
Mengmeng Wang
Yuxuan Wang
Yizhou Wang
C. Zhang
Yixin Zhu
Zilong Zheng
MLLM
CoGe
87
0
0
17 Apr 2025
CM3AE: A Unified RGB Frame and Event-Voxel/-Frame Pre-training Framework
Wentao Wu
Xueliang Wang
Chenglong Li
Bo Jiang
Jin Tang
Bin Luo
Qi Liu
34
0
0
17 Apr 2025
VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization
Menglan Chen
Xianghe Pang
Jingjing Dong
Wenhao Wang
Yaxin Du
Siheng Chen
LRM
39
0
0
17 Apr 2025
EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery
Wei Zhang
Miaoxin Cai
Yaqian Ning
Tianze Zhang
Yin Zhuang
He Chen
Jun Li
Xuerui Mao
38
0
0
17 Apr 2025
Post-pre-training for Modality Alignment in Vision-Language Foundation Models
Shinýa Yamaguchi
Dewei Feng
Sekitoshi Kanai
Kazuki Adachi
Daiki Chijiwa
VLM
34
1
0
17 Apr 2025
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya
Po-Yao (Bernie) Huang
Peize Sun
Jang Hyun Cho
Andrea Madotto
...
Shiyu Dong
Nikhila Ravi
Daniel Li
Piotr Dollár
Christoph Feichtenhofer
ObjD
VOS
103
2
0
17 Apr 2025
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
Haojian Huang
Haodong Chen
Shengqiong Wu
Meng Luo
Jinlan Fu
Xinya Du
Hao Zhang
Hao Fei
AI4TS
205
1
0
17 Apr 2025
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
Pritam Sarkar
Ali Etemad
38
0
0
16 Apr 2025
Efficient Contrastive Decoding with Probabilistic Hallucination Detection - Mitigating Hallucinations in Large Vision Language Models -
Laura Fieback
Nishilkumar Balar
Jakob Spiegelberg
Hanno Gottschalk
MLLM
VLM
85
0
0
16 Apr 2025
Instruction-augmented Multimodal Alignment for Image-Text and Element Matching
Xinli Yue
Jianhui Sun
Junda Lu
Liangchao Yao
Fan Xia
Tianyi Wang
Fengyun Rao
Jing Lyu
Yuetang Deng
30
0
0
16 Apr 2025
PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild
Henghui Ding
Chang Liu
Nikhila Ravi
Shuting He
Y. Wei
...
Haobo Yuan
Xuelong Li
Tao Zhang
Lu Qi
Ming Yang
33
0
0
15 Apr 2025
DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis
Efthymios Georgiou
Vassilis Katsouros
Yannis Avrithis
Alexandros Potamianos
28
1
0
15 Apr 2025
GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR
Christophe Bolduc
Yannick Hold-Geoffroy
Zhixin Shu
Jean-François Lalonde
3DGS
38
0
0
15 Apr 2025
Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception
Ziqi Pang
Xin Xu
Yu-Xiong Wang
DiffM
67
0
0
15 Apr 2025
Video Summarization with Large Language Models
Min Jung Lee
Dayoung Gong
Minsu Cho
31
0
0
15 Apr 2025
Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding
Yuyang Ji
Haohan Wang
LRM
39
0
0
14 Apr 2025
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Tao Zhang
Xuelong Li
Zilong Huang
Y. Li
Weixian Lei
XueQing Deng
Shihao Chen
S. Ji
Jiashi Feng
MLLM
LRM
64
2
0
14 Apr 2025
VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
Ryota Tanaka
Taichi Iki
Taku Hasegawa
Kyosuke Nishida
Kuniko Saito
Jun Suzuki
VLM
52
2
0
14 Apr 2025
Improving Multimodal Hateful Meme Detection Exploiting LMM-Generated Knowledge
Maria Tzelepi
Vasileios Mezaris
34
0
0
14 Apr 2025
Previous
1
2
3
4
5
...
64
65
66
Next