ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.08485
  4. Cited By
Visual Instruction Tuning

Visual Instruction Tuning

17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
    SyDa
    VLM
    MLLM
ArXivPDFHTML

Papers citing "Visual Instruction Tuning"

50 / 3,277 papers shown
Title
JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration
JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration
Yunlong Lin
Zixu Lin
Haoyu Chen
Panwang Pan
C. Li
Sixiang Chen
Yeying Jin
W. J. Li
Xinghao Ding
30
1
0
05 Apr 2025
TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection
TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection
C. Xie
Tongxuan Liu
Lei Jiang
Yuting Zeng
J. Guo
Yunheng Shen
Weizhe Huang
Jing Li
Xiaohua Xu
VLM
61
0
0
05 Apr 2025
NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving
NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving
Kexin Tian
Jingrui Mao
Y. Zhang
Jiwan Jiang
Yang Zhou
Zhengzhong Tu
CoGe
73
0
0
04 Apr 2025
SocialGesture: Delving into Multi-person Gesture Understanding
SocialGesture: Delving into Multi-person Gesture Understanding
Xu Cao
Pranav Virupaksha
Wenqi Jia
Bolin Lai
Fiona Ryan
Sangmin Lee
James M. Rehg
SLR
58
0
0
03 Apr 2025
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models
Mateusz Pach
Shyamgopal Karthik
Quentin Bouniot
Serge Belongie
Zeynep Akata
VLM
69
0
0
03 Apr 2025
STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection
STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection
Divya Velayudhan
A. Ahmed
Mohamad Alansari
Neha Gour
Abderaouf Behouch
...
Muzammal Naseer
Juergen Gall
Mohammed Bennamoun
Ernesto Damiani
Naoufel Werghi
50
0
0
03 Apr 2025
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
Xiangyu Zhao
Peiyuan Zhang
Kexian Tang
Hao Li
Zicheng Zhang
Guangtao Zhai
Junchi Yan
Hua Yang
Xue Yang
Haodong Duan
VLM
LRM
46
1
0
03 Apr 2025
A Survey of Large Language Models in Mental Health Disorder Detection on Social Media
A Survey of Large Language Models in Mental Health Disorder Detection on Social Media
Zhuohan Ge
Nicole Hu
Darian Li
Yubo Wang
Shihao Qi
Yuming Xu
Han Shi
J. Zhang
AI4MH
69
0
0
03 Apr 2025
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Xiaofeng Han
Shunpeng Chen
Zenghuang Fu
Zhe Feng
Lue Fan
...
Li Guo
Weiliang Meng
Xiaopeng Zhang
Rongtao Xu
Shibiao Xu
74
1
0
03 Apr 2025
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
Xianwei Zhuang
Yuxin Xie
Yufan Deng
Dongchao Yang
Liming Liang
Jinghan Ru
Yuguo Yin
Yuexian Zou
71
3
0
03 Apr 2025
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Chuanqi Cheng
Jian Guan
Wei Wu
Rui Yan
VLM
52
0
0
03 Apr 2025
Aligned Better, Listen Better for Audio-Visual Large Language Models
Aligned Better, Listen Better for Audio-Visual Large Language Models
Yuxin Guo
Shuailei Ma
Shijie Ma
Xiaoyi Bao
Chen-Wei Xie
Kecheng Zheng
Tingyu Weng
Siyang Sun
Yun Zheng
Wei Zou
MLLM
AuLLM
67
2
0
02 Apr 2025
PiCo: Jailbreaking Multimodal Large Language Models via $\textbf{Pi}$ctorial $\textbf{Co}$de Contextualization
PiCo: Jailbreaking Multimodal Large Language Models via Pi\textbf{Pi}Pictorial Co\textbf{Co}Code Contextualization
Aofan Liu
Lulu Tang
Ting Pan
Yuguo Yin
Bin Wang
Ao Yang
MLLM
AAML
61
0
0
02 Apr 2025
Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities
Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities
Jing Liu
Wenxuan Wang
Yisi Zhang
Yepeng Tang
Xingjian He
Longteng Guo
Tongtian Yue
Xinlong Wang
ObjD
53
0
0
02 Apr 2025
Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval
Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval
A. Fragomeni
Dima Damen
Michael Wray
35
0
0
02 Apr 2025
AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization
AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization
Chaohu Liu
Tianyi Gui
Yu Liu
Linli Xu
VLM
AAML
70
1
0
02 Apr 2025
ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
Yuejiao Su
Yi Wang
Qiongyang Hu
Chuang Yang
Lap-Pui Chau
49
0
0
02 Apr 2025
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
Jiawei Wang
Yushen Zuo
Yuanjun Chai
Ziqiang Liu
Yichen Fu
Yichun Feng
Kin-Man Lam
AAML
VLM
47
0
0
02 Apr 2025
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
Haochen Wang
Yucheng Zhao
Tiancai Wang
Haoqiang Fan
Xinming Zhang
Zhaoxiang Zhang
75
0
0
02 Apr 2025
UniViTAR: Unified Vision Transformer with Native Resolution
UniViTAR: Unified Vision Transformer with Native Resolution
Limeng Qiao
Yiyang Gan
Bairui Wang
Jie Qin
Shuang Xu
Siqi Yang
Lin Ma
57
0
0
02 Apr 2025
Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction
Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction
Junlong Ren
Hao Wang
45
0
0
02 Apr 2025
TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding
TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding
Junwen Pan
Rui Zhang
Xin Wan
Yuan Zhang
Ming Lu
Qi She
VLM
46
1
0
02 Apr 2025
Slow-Fast Architecture for Video Multi-Modal Large Language Models
Slow-Fast Architecture for Video Multi-Modal Large Language Models
Min Shi
Shihao Wang
Chieh-Yun Chen
Jitesh Jain
Kai Wang
Junjun Xiong
Guilin Liu
Zhiding Yu
Humphrey Shi
40
2
0
02 Apr 2025
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement
Runhui Huang
Chunwei Wang
Junwei Yang
Guansong Lu
Yunlong Yuan
...
Lu Hou
Wei Zhang
Lanqing Hong
Hengshuang Zhao
Hang Xu
MLLM
92
3
0
02 Apr 2025
Prompting Medical Vision-Language Models to Mitigate Diagnosis Bias by Generating Realistic Dermoscopic Images
Prompting Medical Vision-Language Models to Mitigate Diagnosis Bias by Generating Realistic Dermoscopic Images
Nusrat Munia
Abdullah-Al-Zubaer Imran
LM&MA
MedIm
37
0
0
02 Apr 2025
Reasoning LLMs for User-Aware Multimodal Conversational Agents
Reasoning LLMs for User-Aware Multimodal Conversational Agents
Hamed Rahimi
Jeanne Cattoni
Meriem Beghili
Mouad Abrini
Mahdi Khoramshahi
Maribel Pino
Mohamed Chetouani
LRM
36
0
0
02 Apr 2025
WorldPrompter: Traversable Text-to-Scene Generation
WorldPrompter: Traversable Text-to-Scene Generation
Zhaoyang Zhang
Yannick Hold-Geoffroy
Miloš Hašan
Chen Ziwen
Fujun Luan
Julie Dorsey
Yiwei Hu
VGen
53
0
0
02 Apr 2025
Multimodal Reference Visual Grounding
Multimodal Reference Visual Grounding
Yangxiao Lu
Ruosen Li
Liqiang Jing
Jikai Wang
Xinya Du
Yunhui Guo
Nicholas Ruozzi
Yu Xiang
ObjD
81
0
0
02 Apr 2025
POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation
POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation
Lanyun Zhu
Tianrun Chen
Qianxiong Xu
Xuanyi Liu
Deyi Ji
Haiyang Wu
De Wen Soh
Jun Liu
VLM
LRM
50
0
0
01 Apr 2025
Improved Visual-Spatial Reasoning via R1-Zero-Like Training
Improved Visual-Spatial Reasoning via R1-Zero-Like Training
Zhenyi Liao
Qingsong Xie
Yanhao Zhang
Zijian Kong
Haonan Lu
Zhenyu Yang
Zhijie Deng
ReLM
VLM
LRM
101
0
1
01 Apr 2025
IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval
IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval
Bangwei Liu
Yicheng Bao
Shaohui Lin
Xuhong Wang
Xin Tan
Yansen Wang
Yuan Xie
Chaochao Lu
84
0
0
01 Apr 2025
Scaling Prompt Instructed Zero Shot Composed Image Retrieval with Image-Only Data
Scaling Prompt Instructed Zero Shot Composed Image Retrieval with Image-Only Data
Yiqun Duan
Sameera Ramasinghe
Stephen Gould
Ajanthan Thalaiyasingam
48
0
0
01 Apr 2025
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
Jewon Lee
Ki-Ung Song
Seungmin Yang
Donguk Lim
Jaeyeon Kim
Wooksu Shin
Bo-Kyeong Kim
Yong Jae Lee
Tae-Ho Kim
VLM
55
0
0
01 Apr 2025
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?
Kai Yan
Yufei Xu
Zhengyin Du
Xuesong Yao
Zihan Wang
Xiaowen Guo
Jiecao Chen
ReLM
ELM
LRM
95
4
0
01 Apr 2025
Scaling Language-Free Visual Representation Learning
Scaling Language-Free Visual Representation Learning
David Fan
Shengbang Tong
Jiachen Zhu
Koustuv Sinha
Zhuang Liu
...
Michael G. Rabbat
Nicolas Ballas
Yann LeCun
Amir Bar
Saining Xie
CLIP
VLM
75
2
0
01 Apr 2025
4th PVUW MeViS 3rd Place Report: Sa2VA
4th PVUW MeViS 3rd Place Report: Sa2VA
Haobo Yuan
Tao Zhang
Xuelong Li
Lu Qi
Zilong Huang
Shilin Xu
Jiashi Feng
Ming Yang
47
1
0
01 Apr 2025
AI Judges in Design: Statistical Perspectives on Achieving Human Expert Equivalence With Vision-Language Models
AI Judges in Design: Statistical Perspectives on Achieving Human Expert Equivalence With Vision-Language Models
Kristen M. Edwards
Farnaz Tehranchi
Scarlett R. Miller
Faez Ahmed
72
0
0
01 Apr 2025
ShieldGemma 2: Robust and Tractable Image Content Moderation
ShieldGemma 2: Robust and Tractable Image Content Moderation
Wenjun Zeng
D. Kurniawan
Ryan Mullins
Yuchi Liu
Tamoghna Saha
...
Mani Malek
Hamid Palangi
Joon Baek
Rick Pereira
Karthik Narasimhan
AI4MH
36
0
0
01 Apr 2025
Towards Understanding How Knowledge Evolves in Large Vision-Language Models
Towards Understanding How Knowledge Evolves in Large Vision-Language Models
Sudong Wang
Yujie Zhang
Yao Zhu
Jianing Li
Zizhe Wang
Yi Liu
Xiangyang Ji
181
0
0
31 Mar 2025
HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation
HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation
Kun Liu
Qi Liu
Xinchen Liu
Jie Li
Yongdong Zhang
Jiebo Luo
Xiaodong He
Wu Liu
VGen
49
0
0
31 Mar 2025
Communication-Efficient and Personalized Federated Foundation Model Fine-Tuning via Tri-Matrix Adaptation
Communication-Efficient and Personalized Federated Foundation Model Fine-Tuning via Tri-Matrix Adaptation
Yongqian Li
Bo Liu
Sheng Huang
Zhe Zhang
Xiaotong Yuan
Richang Hong
51
0
0
31 Mar 2025
DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance
DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance
Junjie Zheng
Zihao Chen
Chaofan Ding
Xinhan Di
VGen
75
1
0
31 Mar 2025
KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language
KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language
Yoonshik Kim
Jaeyoon Jung
39
0
0
31 Mar 2025
AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference
AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference
Kai Huang
Hao Zou
Bochen Wang
Ye Xi
Zhen Xie
Hao Wang
VLM
49
0
0
31 Mar 2025
Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training
Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training
Yijie Zheng
Bangjun Xiao
Lei Shi
Xiaoyang Li
Faming Wu
Tianyu Li
Xuefeng Xiao
Wenjie Qu
Yansen Wang
Shouda Liu
MLLM
MoE
75
1
0
31 Mar 2025
Self-Evolving Visual Concept Library using Vision-Language Critics
Self-Evolving Visual Concept Library using Vision-Language Critics
Atharva Sehgal
Patrick Yuan
Ziniu Hu
Yisong Yue
Jennifer J. Sun
Swarat Chaudhuri
VLM
50
0
0
31 Mar 2025
Enhancing Image Resolution of Solar Magnetograms: A Latent Diffusion Model Approach
Enhancing Image Resolution of Solar Magnetograms: A Latent Diffusion Model Approach
Francesco P. Ramunno
Paolo Massa
Vitaliy Kinakh
Brandon Panos
A. Csillaghy
S. Voloshynovskiy
DiffM
58
0
0
31 Mar 2025
FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics
FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics
Yixuan Li
Yu Tian
Yipo Huang
Wei Lu
Shiqi Wang
Weisi Lin
Anderson de Rezende Rocha
68
0
0
31 Mar 2025
Consistent Subject Generation via Contrastive Instantiated Concepts
Consistent Subject Generation via Contrastive Instantiated Concepts
Lee Hsin-Ying
Kelvin Chan
Ming Yang
DiffM
95
0
0
31 Mar 2025
XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?
XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?
Fengxiang Wang
H. Wang
Mingshuo Chen
Di Wang
Yulin Wang
...
L. Lan
Wenjing Yang
Jing Zhang
Zhiyuan Liu
Maosong Sun
63
3
0
31 Mar 2025
Previous
123...678...646566
Next