Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2304.08485
Cited By
Visual Instruction Tuning
17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDa
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Visual Instruction Tuning"
50 / 3,278 papers shown
Title
DAVE: Diagnostic benchmark for Audio Visual Evaluation
Gorjan Radevski
Teodora Popordanoska
Matthew B. Blaschko
Tinne Tuytelaars
63
0
0
12 Mar 2025
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
Md. Mohaiminul Islam
Tushar Nagarajan
Huiyu Wang
Gedas Bertasius
Lorenzo Torresani
255
1
0
12 Mar 2025
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers
Ruanjun Li
Yuedong Tan
Yuanming Shi
Jiawei Shao
VLM
213
0
0
12 Mar 2025
Training Data Provenance Verification: Did Your Model Use Synthetic Data from My Generative Model for Training?
Yuechen Xie
Jie Song
Huiqiong Wang
Mingli Song
57
0
0
12 Mar 2025
Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models
Bozhi Luan
Wengang Zhou
Hao Feng
Zhe Wang
Xiaosong Li
Haoyang Li
VLM
70
0
0
11 Mar 2025
Structural and Statistical Texture Knowledge Distillation and Learning for Segmentation
Deyi Ji
Feng Zhao
Hongtao Lu
Feng Wu
Jieping Ye
76
2
0
11 Mar 2025
Trinity: A Modular Humanoid Robot AI System
Jingkai Sun
Qiang Zhang
Gang Han
Wen Zhao
Zhe Yong
Yan He
Jiaxu Wang
Jiahang Cao
Yijie Guo
Renjing Xu
47
0
0
11 Mar 2025
Seeing and Reasoning with Confidence: Supercharging Multimodal LLMs with an Uncertainty-Aware Agentic Framework
Zhuo Zhi
Chen Feng
Adam Daneshmend
Mine Orlu
Andreas Demosthenous
L. Yin
Da Li
Ziquan Liu
Miguel R. D. Rodrigues
LRM
77
1
0
11 Mar 2025
EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments
Dongping Li
Tielong Cai
Tianci Tang
Wenhao Chai
Katherine Rose Driggs-Campbell
Gaoang Wang
LM&Ro
71
0
0
11 Mar 2025
DiffEGG: Diffusion-Driven Edge Generation as a Pixel-Annotation-Free Alternative for Instance Annotation
Sanghyun Jo
Ziseok Lee
Wooyeol Lee
Kyungsu Kim
52
0
0
11 Mar 2025
EgoBlind: Towards Egocentric Visual Assistance for the Blind People
Junbin Xiao
Nanxin Huang
Hao Qiu
Zhulin Tao
Xun Yang
Richang Hong
Ming Wang
Angela Yao
EgoV
VLM
73
0
0
11 Mar 2025
Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis
Letian Zhang
Quan Cui
Bingchen Zhao
Cheng Yang
MLLM
SyDa
59
1
0
11 Mar 2025
PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability
Weijie Zhou
Manli Tao
Chaoyang Zhao
Haiyun Guo
Honghui Dong
Ming Tang
Jinqiao Wang
51
1
0
11 Mar 2025
Learning to Match Unpaired Data with Minimum Entropy Coupling
Mustapha Bounoua
Giulio Franzese
Pietro Michiardi
44
0
0
11 Mar 2025
Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos
Soumya Jahagirdar
Jayasree Saha
C. V. Jawahar
61
0
0
11 Mar 2025
LongProLIP: A Probabilistic Vision-Language Model with Long Context Text
Sanghyuk Chun
Sangdoo Yun
VLM
56
1
0
11 Mar 2025
Referring to Any Person
Qing Jiang
Lin Wu
Zhaoyang Zeng
Tianhe Ren
Yuda Xiong
Yihao Chen
Qin Liu
Lei Zhang
258
0
0
11 Mar 2025
Aligning Text to Image in Diffusion Models is Easier Than You Think
J. Lee
Byunghee Cha
Jeongsol Kim
Jong Chul Ye
56
0
0
11 Mar 2025
LangTime: A Language-Guided Unified Model for Time Series Forecasting with Proximal Policy Optimization
Wenzhe Niu
Zongxia Xie
Yanru Sun
Wei He
Man Xu
Chao Hao
AI4TS
55
1
0
11 Mar 2025
OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models
Jialv Zou
Bencheng Liao
Qian Zhang
Wenyu Liu
Xinggang Wang
Mamba
MLLM
82
1
0
11 Mar 2025
Attention Hijackers: Detect and Disentangle Attention Hijacking in LVLMs for Hallucination Mitigation
Beitao Chen
Xinyu Lyu
Lianli Gao
Jingkuan Song
H. Shen
75
1
0
11 Mar 2025
MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models
Han Zhao
Wenxuan Song
Donglin Wang
Xinyang Tong
Pengxiang Ding
Xuelian Cheng
Zongyuan Ge
68
2
0
11 Mar 2025
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
Bo Jiang
Shaoyu Chen
Qian Zhang
Wenyu Liu
Xinggang Wang
OffRL
LRM
VLM
84
5
0
10 Mar 2025
Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning
Bardia Safaei
Faizan Siddiqui
Jiacong Xu
Vishal M. Patel
Shao-Yuan Lo
VLM
259
0
0
10 Mar 2025
Customized SAM 2 for Referring Remote Sensing Image Segmentation
Fu Rong
Meng Lan
Qian Zhang
Lefei Zhang
52
0
0
10 Mar 2025
Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment
Xing Xie
Jiawei Liu
Ziyue Lin
Huijie Fan
Zhi Han
Yandong Tang
Liangqiong Qu
47
0
0
10 Mar 2025
A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis
Xiang Liu
Zhaoxiang Liu
Huan Hu
Zezhou Chen
Kohou Wang
Ning Wang
Kai Wang
43
1
0
10 Mar 2025
Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts
Shiu-hong Kao
Yu-Wing Tai
Chi-Keung Tang
LRM
MLLM
66
0
0
10 Mar 2025
FedRand: Enhancing Privacy in Federated Learning with Randomized LoRA Subparameter Updates
Sangwoo Park
Seanie Lee
Byungjoo Kim
Sung Ju Hwang
FedML
52
0
0
10 Mar 2025
Boosting Diffusion-Based Text Image Super-Resolution Model Towards Generalized Real-World Scenarios
Chenglu Pan
Xiaogang Xu
Ganggui Ding
Yunke Zhang
Wenbo Li
Jiarong Xu
Qingbiao Wu
60
0
0
10 Mar 2025
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
Junwei Luo
Yingying Zhang
Xiaoyu Yang
Kang Wu
Qi Zhu
Lei Liang
Jingdong Chen
Yansheng Li
72
1
0
10 Mar 2025
Keeping Representation Similarity in Finetuning for Medical Image Analysis
Wenqiang Zu
Shenghao Xie
Hao Chen
Yiming Liang
Lei Ma
MedIm
OOD
53
0
0
10 Mar 2025
Should VLMs be Pre-trained with Image Data?
Sedrick Scott Keh
Jean Mercat
S. Gadre
Kushal Arora
Igor Vasiljevic
...
Shuran Song
Russ Tedrake
Thomas Kollar
Ludwig Schmidt
Achal Dave
VLM
49
0
0
10 Mar 2025
LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs
Hanyu Zhou
Gim Hee Lee
47
0
0
10 Mar 2025
DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs
Jongwoo Ko
Tianyi Chen
Sungnyun Kim
Tianyu Ding
Luming Liang
Ilya Zharkov
Se-Young Yun
VLM
246
0
0
10 Mar 2025
Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning
Qinghao Ye
Xianhan Zeng
Fu Li
Chong Li
Haoqi Fan
CoGe
88
2
0
10 Mar 2025
LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?
Bangyan Li
Wenxuan Huang
Yunhang Shen
Yansen Wang
Shaohui Lin
...
Ling You
Yinqi Zhang
Ke Li
Xing Sun
Yan Sun
61
2
0
10 Mar 2025
V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation
Guiwei Zhang
Tianyu Zhang
Mohan Zhou
Yalong Bai
Biye Li
69
0
0
10 Mar 2025
AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning
Yangzhe Kong
Daeun Song
Jing Liang
Dinesh Manocha
Ziyu Yao
Xuesu Xiao
LRM
60
1
0
10 Mar 2025
EAZY: Eliminating Hallucinations in LVLMs by Zeroing out Hallucinatory Image Tokens
Liwei Che
Tony Qingze Liu
Jing Jia
Weiyi Qin
Ruixiang Tang
Vladimir Pavlovic
MLLM
VLM
110
1
0
10 Mar 2025
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
Yingzhe Peng
Gongrui Zhang
Miaosen Zhang
Zhiyuan You
Jie Liu
Qipeng Zhu
Kai Yang
Xingzhong Xu
Xin Geng
Xu Yang
LRM
ReLM
100
33
0
10 Mar 2025
Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning
Huilin Deng
Ding Zou
Rui Ma
Hongchen Luo
Yang Cao
Yu Kang
LRM
VLM
62
6
0
10 Mar 2025
Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction
Zongzheng Zhang
Xinrun Li
Sizhe Zou
Guoxuan Chi
Siqi Li
...
Guoliang Wang
Guantian Zheng
Leichen Wang
Hang Zhao
Hao Zhao
67
0
0
10 Mar 2025
Is CLIP ideal? No. Can we fix it? Yes!
Raphi Kang
Yue Song
Georgia Gkioxari
Pietro Perona
VLM
66
0
0
10 Mar 2025
Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning
Jiazheng Liu
Sipeng Zheng
Börje F. Karlsson
Zongqing Lu
34
0
0
10 Mar 2025
PointVLA: Injecting the 3D World into Vision-Language-Action Models
Chengmeng Li
Junjie Wen
Yan Peng
Chaomin Shen
Feifei Feng
Bo Li
3DPC
73
4
0
10 Mar 2025
REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding
Yan Tai
Luhao Zhu
Zhiqiang Chen
Ynan Ding
Yiying Dong
Xiaohong Liu
Guodong Guo
MLLM
ObjD
57
0
0
10 Mar 2025
Less is More: Adaptive Program Repair with Bug Localization and Preference Learning
Zhenlong Dai
Bingrui Chen
Zhuoluo Zhao
Xiu Tang
Sai Wu
Chang Yao
Zhipeng Gao
Jingyuan Chen
KELM
59
2
0
09 Mar 2025
SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation
Zhenpeng Chen
Chunwei Wang
Xiuwei Chen
Hang Xu
Jiawei Han
Xiandan Liang
VLM
76
1
0
09 Mar 2025
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Yuqi Liu
Bohao Peng
Zhisheng Zhong
Zihao Yue
Fanbin Lu
Bei Yu
Jiaya Jia
LRM
VLM
60
13
0
09 Mar 2025
Previous
1
2
3
...
12
13
14
...
64
65
66
Next