Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2304.08485
Cited By
Visual Instruction Tuning
17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDa
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Visual Instruction Tuning"
50 / 3,278 papers shown
Title
PHT-CAD: Efficient CAD Parametric Primitive Analysis with Progressive Hierarchical Tuning
Ke Niu
Yuwen Chen
Haiyang Yu
Z. Chen
Xianghui Que
Bin Li
Xiangyang Xue
60
0
0
23 Mar 2025
GOAL: Global-local Object Alignment Learning
Hyungyu Choi
Young Kyun Jang
Chanho Eom
VLM
222
0
0
22 Mar 2025
CODA: Repurposing Continuous VAEs for Discrete Tokenization
Zeyu Liu
Zanlin Ni
Yeguo Hua
Xin Deng
Xiao Ma
Cheng Zhong
Gao Huang
52
0
0
22 Mar 2025
Towards Transformer-Based Aligned Generation with Self-Coherence Guidance
Shulei Wang
Wang Lin
Hai Huang
Hanting Wang
Sihang Cai
...
Tao Jin
Jingyuan Chen
Jiacheng Sun
Jieming Zhu
Zhou Zhao
DiffM
68
2
0
22 Mar 2025
Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback
Yalan Qin
Xiuying Chen
Rui Pan
Han Zhu
Chen Zhang
...
Chi-Min Chan
Sirui Han
Yike Guo
Yiran Yang
Yaodong Yang
OffRL
82
4
0
22 Mar 2025
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
Wenxuan Zhu
Bing Li
Cheng Zheng
Jinjie Mai
Jun-Cheng Chen
...
Abdullah Hamdi
Sara Rojas Martinez
Chia-Wen Lin
Mohamed Elhoseiny
Bernard Ghanem
VLM
53
0
0
22 Mar 2025
good4cir: Generating Detailed Synthetic Captions for Composed Image Retrieval
Pranavi Kolouju
Eric Xing
Robert Pless
Nathan Jacobs
Abby Stylianou
3DV
58
0
0
22 Mar 2025
LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models
Jian Liang
Wenke Huang
Guancheng Wan
Qu Yang
Mang Ye
MoMe
CLL
AI4CE
62
2
0
21 Mar 2025
ModalTune: Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology
Vishwesh Ramanathan
Tony Xu
Pushpak Pati
Faruk Ahmed
Maged Goubran
Anne L. Martel
50
0
0
21 Mar 2025
PVChat: Personalized Video Chat with One-Shot Learning
Yufei Shi
Weilong Yan
Gang Xu
Yumeng Li
Yong Li
Zechao Li
Fei Richard Yu
Ming Li
Si Yong Yeo
45
0
0
21 Mar 2025
MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic Workflow
Ziyue Wang
Junde Wu
Linghan Cai
Chang Han Low
Xihong Yang
Qiaxuan Li
Yueming Jin
LRM
70
2
0
21 Mar 2025
Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
Jianing Qi
Jiawei Liu
Hao Tang
Zhigang Zhu
114
1
0
21 Mar 2025
Position: Interactive Generative Video as Next-Generation Game Engine
Jiwen Yu
Yiran Qin
Haoxuan Che
Quande Liu
Xintao Wang
Pengfei Wan
Di Zhang
Xihui Liu
VGen
50
1
0
21 Mar 2025
TruthLens: Explainable DeepFake Detection for Face Manipulated and Fully Synthetic Data
Rohit Kundu
Athula Balachandran
Amit K. Roy-Chowdhury
50
0
0
20 Mar 2025
GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions
Xiaomeng Chu
Jiajun Deng
Guoliang You
Wei Liu
Xuran Li
Jianmin Ji
Wenjie Qu
84
0
0
20 Mar 2025
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations
Kyungho Bae
Jinhyung Kim
Sihaeng Lee
Soonyoung Lee
G. Lee
Jinwoo Choi
66
1
0
20 Mar 2025
Video-VoT-R1: An efficient video inference model integrating image packing and AoE architecture
Cheng Li
Jiexiong Liu
Yixuan Chen
Yanqin Jia
MLLM
VLM
76
0
0
20 Mar 2025
VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness
SeungJu Cha
Kwanyoung Lee
Ye-Chan Kim
Hyunwoo Oh
Dong-Jin Kim
48
0
0
20 Mar 2025
Enhancing Zero-Shot Image Recognition in Vision-Language Models through Human-like Concept Guidance
Hui Liu
Wenya Wang
Kecheng Chen
Jie Liu
Yibing Liu
Tiexin Qin
Peisong He
Xinghao Jiang
Haoliang Li
BDL
VLM
257
0
0
20 Mar 2025
BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models
Zenghui Yuan
Jiawen Shi
Pan Zhou
Neil Zhenqiang Gong
Lichao Sun
AAML
70
1
0
20 Mar 2025
UMIT: Unifying Medical Imaging Tasks via Vision-Language Models
Haiyang Yu
Siyang Yi
Ke Niu
Minghan Zhuo
Bin Li
LM&MA
55
0
0
20 Mar 2025
A Vision Centric Remote Sensing Benchmark
Abduljaleel Adejumo
Faegheh Yeganli
Clifford Broni-bediako
Aoran Xiao
Naoto Yokoya
Mennatullah Siam
67
0
0
20 Mar 2025
IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes
Haochen Zhang
Nader Zantout
Pujith Kachana
Ji Zhang
Wenshan Wang
VGen
56
0
0
20 Mar 2025
EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation
Zihao Zhang
Haoran Chen
Haoyu Zhao
Guansong Lu
Yanwei Fu
Hang Xu
Zuxuan Wu
VGen
DiffM
79
1
0
20 Mar 2025
Intelligent Spatial Perception by Building Hierarchical 3D Scene Graphs for Indoor Scenarios with the Help of LLMs
Yao Cheng
Zhe Han
Fengyang Jiang
Huaizhen Wang
Fengyu Zhou
Qingshan Yin
Lei Wei
3DV
49
1
0
19 Mar 2025
VisNumBench: Evaluating Number Sense of Multimodal Large Language Models
Tengjin Weng
Jingyi Wang
Wenhao Jiang
Zhong Ming
VLM
LRM
54
0
0
19 Mar 2025
EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
Boshen Xu
Yuting Mei
Xinbi Liu
Sipeng Zheng
Qin Jin
VLM
MDE
73
0
0
19 Mar 2025
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems
Felix Chen
Hangjie Yuan
Yunqiu Xu
Tao Feng
Jun Cen
Pengwei Liu
Zeying Huang
Yi Yang
LRM
50
1
0
19 Mar 2025
Vision-Speech Models: Teaching Speech Models to Converse about Images
Amélie Royer
Moritz Böhle
Gabriel de Marmiesse
Laurent Mazaré
Neil Zeghidour
Alexandre Défossez
P. Pérez
AuLLM
VLM
86
0
0
19 Mar 2025
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation
Qihui Zhang
Munan Ning
Zheyuan Liu
Yanbo Wang
Jiayi Ye
Yue Huang
Shuo Yang
Xiao Chen
Y. Song
Li Yuan
LRM
65
0
0
19 Mar 2025
TULIP: Towards Unified Language-Image Pretraining
Zineng Tang
Long Lian
Seun Eisape
Xudong Wang
Roei Herzig
Adam Yala
Alane Suhr
Trevor Darrell
David M. Chan
VLM
CLIP
MLLM
103
3
0
19 Mar 2025
Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations
Shuo Li
Jiajun Sun
Guodong Zheng
Xiaoran Fan
Yujiong Shen
...
Wenming Tan
Tao Ji
Tao Gui
Qi Zhang
Xuanjing Huang
AAML
VLM
92
1
0
19 Mar 2025
A Review on Large Language Models for Visual Analytics
Navya Sonal Agarwal
Sanjay Kumar Sonbhadra
63
0
0
19 Mar 2025
Unlocking the Capabilities of Vision-Language Models for Generalizable and Explainable Deepfake Detection
Peipeng Yu
Jianwei Fei
Hui Gao
Xuan Feng
Zhihua Xia
Chip-Hong Chang
MLLM
VLM
81
1
0
19 Mar 2025
Optimal Transport Adapter Tuning for Bridging Modality Gaps in Few-Shot Remote Sensing Scene Classification
Zhong Ji
Ci Liu
Jingren Liu
Chen Tang
Yanwei Pang
Xuelong Li
OT
53
0
0
19 Mar 2025
POSTA: A Go-to Framework for Customized Artistic Poster Generation
Haoyu Chen
Xiaojie Xu
Wenbo Li
Jingjing Ren
Tian Ye
Songhua Liu
Ying Chen
Lei Zhu
Xinchao Wang
DiffM
67
1
0
19 Mar 2025
Visual Persona: Foundation Model for Full-Body Human Customization
Jisu Nam
Soowon Son
Zhan Xu
Jing Shi
Difan Liu
Feng Liu
Aashish Misraa
Seungryong Kim
Yang Zhou
DiffM
56
0
0
19 Mar 2025
1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
Kevin Wang
Ishaan Javali
Michał Bortkiewicz
Tomasz Trzciñski
Benjamin Eysenbach
SSL
OffRL
77
1
0
19 Mar 2025
Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching
Yang Liu
Wentao Feng
Zhuoyao Liu
Shudong Huang
Jiancheng Lv
DiffM
VLM
53
0
0
19 Mar 2025
Visual Position Prompt for MLLM based Visual Grounding
Wei Tang
Yanpeng Sun
Qinying Gu
Zechao Li
VLM
50
0
0
19 Mar 2025
EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models
Yinan Liang
Zehua Wang
Xiuwei Xu
Jie Zhou
Jiwen Lu
VLM
LRM
56
0
0
19 Mar 2025
Universal Scene Graph Generation
Shengqiong Wu
Hao Fei
Tat-Seng Chua
45
0
0
19 Mar 2025
Cube: A Roblox View of 3D Intelligence
Foundation AI Team Roblox
Kiran Bhat
Nishchaie Khanna
Karun Channa
Tinghui Zhou
...
Kyle Price
Steve Han
Yiqing Wang
A. Singh
David Baszucki
68
0
0
19 Mar 2025
Improving LLM Video Understanding with 16 Frames Per Second
Yong Li
Changli Tang
Jimin Zhuang
Yudong Yang
Guangzhi Sun
W. Li
Zejun Ma
Chao Zhang
VLM
90
1
0
18 Mar 2025
Tracking Meets Large Multimodal Models for Driving Scenario Understanding
Ayesha Ishaq
Jean Lahoud
Fahad Shahbaz Khan
Salman Khan
Hisham Cholakkal
Rao Muhammad Anwer
59
0
0
18 Mar 2025
LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation
Yang Zhou
Shiyu Zhao
Yuxiao Chen
Zhangyu Wang
Can Jin
Dimitris N. Metaxas
ObjD
64
0
0
18 Mar 2025
ExDDV: A New Dataset for Explainable Deepfake Detection in Video
Vlad Hondru
Eduard Hogea
Darian M. Onchis
Radu Tudor Ionescu
65
1
0
18 Mar 2025
Can Large Vision Language Models Read Maps Like a Human?
Shuo Xing
Zezhou Sun
Shuangyu Xie
Kaiyuan Chen
Yanjia Huang
Yuping Wang
Jiachen Li
Dezhen Song
Zhengzhong Tu
72
3
0
18 Mar 2025
MP-GUI: Modality Perception with MLLMs for GUI Understanding
Ziwei Wang
Weizhi Chen
Leyang Yang
Sheng Zhou
Shengchu Zhao
Hanbei Zhan
Jiongchao Jin
Liangcheng Li
Zirui Shao
Jiajun Bu
81
1
0
18 Mar 2025
ChatBEV: A Visual Language Model that Understands BEV Maps
Qingyao Xu
Tian Jin
Guang Chen
Yanfeng Wang
Yuyao Zhang
51
0
0
18 Mar 2025
Previous
1
2
3
...
9
10
11
...
64
65
66
Next