ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.08485
  4. Cited By
Visual Instruction Tuning

Visual Instruction Tuning

17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
    SyDa
    VLM
    MLLM
ArXivPDFHTML

Papers citing "Visual Instruction Tuning"

50 / 3,253 papers shown
Title
Perception in Reflection
Perception in Reflection
Yana Wei
Liang Zhao
Kangheng Lin
En Yu
Yuang Peng
...
Jianjian Sun
Haoran Wei
Zheng Ge
Xiangyu Zhang
Vishal M. Patel
31
0
0
09 Apr 2025
Measuring Déjà vu Memorization Efficiently
Measuring Déjà vu Memorization Efficiently
Narine Kokhlikyan
Bargav Jayaraman
Florian Bordes
Chuan Guo
Kamalika Chaudhuri
30
1
0
08 Apr 2025
On the Suitability of Reinforcement Fine-Tuning to Visual Tasks
On the Suitability of Reinforcement Fine-Tuning to Visual Tasks
X. Chen
Wei Li
Chunxu Liu
Chi Xie
Xiaoyan Hu
Chengqian Ma
Feng Zhu
Rui Zhao
ReLM
LRM
56
0
0
08 Apr 2025
PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning
PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning
Xinpeng Ding
Kaipeng Zhang
Jinahua Han
Lanqing Hong
Hang Xu
Xuelong Li
MLLM
VLM
242
0
0
08 Apr 2025
Earth-Adapter: Bridge the Geospatial Domain Gaps with Mixture of Frequency Adaptation
Earth-Adapter: Bridge the Geospatial Domain Gaps with Mixture of Frequency Adaptation
Xiaoxing Hu
Ziyang Gong
Yansen Wang
Yuru Jia
Gen Luo
Xue Yang
157
0
0
08 Apr 2025
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models
Xiangxi Zheng
Linjie Li
Zhiyong Yang
Ping Yu
Alex Jinpeng Wang
Rui Yan
Yuan Yao
Lijuan Wang
LRM
26
0
0
08 Apr 2025
MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models
MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models
Pengfei Zhou
Fanrui Zhang
Xiaopeng Peng
Zhaopan Xu
Jiaxin Ai
...
Kai Wang
Xiaojun Chang
Wenqi Shao
Yang You
Kaipeng Zhang
ELM
LRM
37
0
0
08 Apr 2025
SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation
SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation
Hao Du
Bo Wu
Yan Lu
Zhendong Mao
29
0
0
08 Apr 2025
Transfer between Modalities with MetaQueries
Transfer between Modalities with MetaQueries
Xichen Pan
Satya Narayan Shukla
Aashu Singh
Zhuokai Zhao
Shlok Kumar Mishra
...
Jiuhai Chen
Kunpeng Li
F. Xu
Ji Hou
Saining Xie
DiffM
49
7
0
08 Apr 2025
OmniSVG: A Unified Scalable Vector Graphics Generation Model
OmniSVG: A Unified Scalable Vector Graphics Generation Model
Yiying Yang
Wei Cheng
Sijin Chen
Xianfang Zeng
Jiaxu Zhang
Liao Wang
Gang Yu
Xingjun Ma
Yu Jiang
VLM
45
0
0
08 Apr 2025
Taxonomy-Aware Evaluation of Vision-Language Models
Taxonomy-Aware Evaluation of Vision-Language Models
Vésteinn Snæbjarnarson
Kevin Du
Niklas Stoehr
Serge Belongie
Ryan Cotterell
Nico Lang
Stella Frank
37
0
0
07 Apr 2025
REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding
REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding
Sakib Reza
Xiyun Song
Heather Yu
Zongfang Lin
Mohsen Moghaddam
Mario Sznaier
34
0
0
07 Apr 2025
Ternarization of Vision Language Models for use on edge devices
Ternarization of Vision Language Models for use on edge devices
Ben Crulis
Cyril de Runz
Barthélémy Serres
Gilles Venturini
VLM
55
0
0
07 Apr 2025
OrderChain: A General Prompting Paradigm to Improve Ordinal Understanding Ability of MLLM
OrderChain: A General Prompting Paradigm to Improve Ordinal Understanding Ability of MLLM
Jinhong Wang
Shuo Tong
Jian Liu
Dongqi Tang
Weiqiang Wang
Wentong Li
Hongxia Xu
Danny Chen
Jintai Chen
Jian Wu
LRM
26
0
0
07 Apr 2025
SmolVLM: Redefining small and efficient multimodal models
SmolVLM: Redefining small and efficient multimodal models
Andres Marafioti
Orr Zohar
Miquel Farré
Merve Noyan
Elie Bakouch
...
Hugo Larcher
Mathieu Morlon
Lewis Tunstall
Leandro von Werra
Thomas Wolf
VLM
44
8
0
07 Apr 2025
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
Yunlong Tang
Jing Bi
Chao Huang
Susan Liang
Daiki Shimada
...
Jinxi He
Liu He
Zeliang Zhang
Jiebo Luo
Chenliang Xu
42
0
0
07 Apr 2025
The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation
The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation
Hao Fang
Runmin Cong
Xiankai Lu
Z. Chen
Wei Zhang
29
0
0
07 Apr 2025
Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data
Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data
Samarth Mishra
Kate Saenko
Venkatesh Saligrama
CoGe
LRM
39
0
0
07 Apr 2025
SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models
SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models
Justus Westerhoff
Erblina Purellku
Jakob Hackstein
Jonas Loos
Leo Pinetzki
Lorenz Hufe
AAML
28
0
0
07 Apr 2025
InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
Sai Kumar Dwivedi
Dimitrije Antić
Shashank Tripathi
Omid Taheri
Cordelia Schmid
M. Black
Dimitrios Tzionas
40
1
0
07 Apr 2025
URECA: Unique Region Caption Anything
URECA: Unique Region Caption Anything
Sangbeom Lim
J. Kim
Heeji Yoon
Jaewoo Jung
Seungryong Kim
33
0
0
07 Apr 2025
Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions
Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions
He Zhu
Quyu Kong
Kechun Xu
Xunlong Xia
Bing Deng
Jieping Ye
R. Xiong
Yansen Wang
37
0
0
07 Apr 2025
UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
Yang Jiao
Haibo Qiu
Zequn Jie
Tian Jin
Jingjing Chen
Lin Ma
Yu Jiang
34
2
0
06 Apr 2025
Domain Generalization for Face Anti-spoofing via Content-aware Composite Prompt Engineering
Domain Generalization for Face Anti-spoofing via Content-aware Composite Prompt Engineering
Jiaxin Guo
Ajian Liu
Yunfeng Diao
Jingyang Zhang
Hui Ma
Bo Zhao
Richang Hong
Meng Wang
21
0
0
06 Apr 2025
M2IV: Towards Efficient and Fine-grained Multimodal In-Context Learning in Large Vision-Language Models
M2IV: Towards Efficient and Fine-grained Multimodal In-Context Learning in Large Vision-Language Models
Yanshu Li
Hongyang He
Yi Cao
Qisen Cheng
Xiang Fu
Ruixiang Tang
VLM
45
0
0
06 Apr 2025
The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models?
The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models?
Weichen Zhang
Ruiying Peng
Chen Gao
Jianjie Fang
Xin Zeng
...
Zihan Wang
Jinqiang Cui
Xin Wang
Xinlei Chen
Yong Li
LRM
81
0
0
06 Apr 2025
MedM-VL: What Makes a Good Medical LVLM?
MedM-VL: What Makes a Good Medical LVLM?
Yiming Shi
Shaoshuai Yang
Xun Zhu
Haoyu Wang
Miao Li
Ji Wu
VLM
40
1
0
06 Apr 2025
TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection
TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection
C. Xie
Tongxuan Liu
Lei Jiang
Yuting Zeng
J. Guo
Yunheng Shen
Weizhe Huang
Jing Li
Xiaohua Xu
VLM
61
0
0
05 Apr 2025
Window Token Concatenation for Efficient Visual Large Language Models
Window Token Concatenation for Efficient Visual Large Language Models
Yifan Li
Wentao Bao
Botao Ye
Zhen Tan
Tianlong Chen
Huan Liu
Yu Kong
VLM
44
0
0
05 Apr 2025
JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration
JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration
Yunlong Lin
Zixu Lin
Haoyu Chen
Panwang Pan
C. Li
Sixiang Chen
Yeying Jin
W. J. Li
Xinghao Ding
28
1
0
05 Apr 2025
NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving
NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving
Kexin Tian
Jingrui Mao
Y. Zhang
Jiwan Jiang
Yang Zhou
Zhengzhong Tu
CoGe
73
0
0
04 Apr 2025
SocialGesture: Delving into Multi-person Gesture Understanding
SocialGesture: Delving into Multi-person Gesture Understanding
Xu Cao
Pranav Virupaksha
Wenqi Jia
Bolin Lai
Fiona Ryan
Sangmin Lee
James M. Rehg
SLR
58
0
0
03 Apr 2025
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
Xiangyu Zhao
Peiyuan Zhang
Kexian Tang
Hao Li
Zicheng Zhang
Guangtao Zhai
Junchi Yan
Hua Yang
Xue Yang
Haodong Duan
VLM
LRM
46
1
0
03 Apr 2025
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
Xianwei Zhuang
Yuxin Xie
Yufan Deng
Dongchao Yang
Liming Liang
Jinghan Ru
Yuguo Yin
Yuexian Zou
71
3
0
03 Apr 2025
STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection
STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection
Divya Velayudhan
A. Ahmed
Mohamad Alansari
Neha Gour
Abderaouf Behouch
...
Muzammal Naseer
Juergen Gall
Mohammed Bennamoun
Ernesto Damiani
Naoufel Werghi
50
0
0
03 Apr 2025
A Survey of Large Language Models in Mental Health Disorder Detection on Social Media
A Survey of Large Language Models in Mental Health Disorder Detection on Social Media
Zhuohan Ge
Nicole Hu
Darian Li
Yubo Wang
Shihao Qi
Yuming Xu
Han Shi
J. Zhang
AI4MH
63
0
0
03 Apr 2025
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Xiaofeng Han
Shunpeng Chen
Zenghuang Fu
Zhe Feng
Lue Fan
...
Li Guo
Weiliang Meng
Xiaopeng Zhang
Rongtao Xu
Shibiao Xu
74
1
0
03 Apr 2025
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models
Mateusz Pach
Shyamgopal Karthik
Quentin Bouniot
Serge Belongie
Zeynep Akata
VLM
69
0
0
03 Apr 2025
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Chuanqi Cheng
Jian Guan
Wei Wu
Rui Yan
VLM
52
0
0
03 Apr 2025
PiCo: Jailbreaking Multimodal Large Language Models via $\textbf{Pi}$ctorial $\textbf{Co}$de Contextualization
PiCo: Jailbreaking Multimodal Large Language Models via Pi\textbf{Pi}Pictorial Co\textbf{Co}Code Contextualization
Aofan Liu
Lulu Tang
Ting Pan
Yuguo Yin
Bin Wang
Ao Yang
MLLM
AAML
61
0
0
02 Apr 2025
Reasoning LLMs for User-Aware Multimodal Conversational Agents
Reasoning LLMs for User-Aware Multimodal Conversational Agents
Hamed Rahimi
Jeanne Cattoni
Meriem Beghili
Mouad Abrini
Mahdi Khoramshahi
Maribel Pino
Mohamed Chetouani
LRM
36
0
0
02 Apr 2025
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
Jiawei Wang
Yushen Zuo
Yuanjun Chai
Ziqiang Liu
Yichen Fu
Yichun Feng
Kin-Man Lam
AAML
VLM
47
0
0
02 Apr 2025
Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction
Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction
Junlong Ren
Hao Wang
45
0
0
02 Apr 2025
ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
Yuejiao Su
Yi Wang
Qiongyang Hu
Chuang Yang
Lap-Pui Chau
47
0
0
02 Apr 2025
Slow-Fast Architecture for Video Multi-Modal Large Language Models
Slow-Fast Architecture for Video Multi-Modal Large Language Models
Min Shi
Shihao Wang
Chieh-Yun Chen
Jitesh Jain
Kai Wang
Junjun Xiong
Guilin Liu
Zhiding Yu
Humphrey Shi
40
2
0
02 Apr 2025
WorldPrompter: Traversable Text-to-Scene Generation
WorldPrompter: Traversable Text-to-Scene Generation
Zhaoyang Zhang
Yannick Hold-Geoffroy
Miloš Hašan
Chen Ziwen
Fujun Luan
Julie Dorsey
Yiwei Hu
VGen
53
0
0
02 Apr 2025
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
Haochen Wang
Yucheng Zhao
Tiancai Wang
Haoqiang Fan
Xinming Zhang
Zhaoxiang Zhang
75
0
0
02 Apr 2025
Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities
Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities
Jing Liu
Wenxuan Wang
Yisi Zhang
Yepeng Tang
Xingjian He
Longteng Guo
Tongtian Yue
Xinlong Wang
ObjD
53
0
0
02 Apr 2025
UniViTAR: Unified Vision Transformer with Native Resolution
UniViTAR: Unified Vision Transformer with Native Resolution
Limeng Qiao
Yiyang Gan
Bairui Wang
Jie Qin
Shuang Xu
Siqi Yang
Lin Ma
57
0
0
02 Apr 2025
Multimodal Reference Visual Grounding
Multimodal Reference Visual Grounding
Yangxiao Lu
Ruosen Li
Liqiang Jing
Jikai Wang
Xinya Du
Yunhui Guo
Nicholas Ruozzi
Yu Xiang
ObjD
81
0
0
02 Apr 2025
Previous
123...567...646566
Next