ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.08485
  4. Cited By
Visual Instruction Tuning

Visual Instruction Tuning

17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
    SyDa
    VLM
    MLLM
ArXivPDFHTML

Papers citing "Visual Instruction Tuning"

50 / 3,279 papers shown
Title
The BrowserGym Ecosystem for Web Agent Research
The BrowserGym Ecosystem for Web Agent Research
Thibault Le Sellier De Chezelles
Maxime Gasse
Alexandre Lacoste
Alexandre Drouin
Massimo Caccia
...
Siva Reddy
Quentin Cappart
Graham Neubig
Ruslan Salakhutdinov
Nicolas Chapados
LLMAG
120
11
0
06 Dec 2024
ARTeFACT: Benchmarking Segmentation Models on Diverse Analogue Media
  Damage
ARTeFACT: Benchmarking Segmentation Models on Diverse Analogue Media Damage
D. Ivanova
Marco Aversa
Paul Henderson
John Williamson
99
0
0
05 Dec 2024
Florence-VL: Enhancing Vision-Language Models with Generative Vision
  Encoder and Depth-Breadth Fusion
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Jiuhai Chen
Jianwei Yang
Haiping Wu
Dianqi Li
Jianfeng Gao
Tianyi Zhou
Bin Xiao
VLM
69
5
0
05 Dec 2024
Bench-CoE: a Framework for Collaboration of Experts from Benchmark
Bench-CoE: a Framework for Collaboration of Experts from Benchmark
Yuanshuai Wang
Xingjian Zhang
Jinkun Zhao
Siwei Wen
Peilin Feng
Shuhao Liao
Lei Huang
Wenjun Wu
MoE
ALM
96
2
0
05 Dec 2024
Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise
  Optimization
Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization
Jiangweizhi Peng
Zhiwei Tang
Gaowen Liu
Charles Fleming
Mingyi Hong
87
2
0
05 Dec 2024
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
Lu Qiu
Yuying Ge
Yi Chen
Yixiao Ge
Ying Shan
Xihui Liu
LLMAG
LRM
106
5
0
05 Dec 2024
From Individual to Society: A Survey on Social Simulation Driven by
  Large Language Model-based Agents
From Individual to Society: A Survey on Social Simulation Driven by Large Language Model-based Agents
Xinyi Mou
Xuanwen Ding
Qi He
Liang Wang
Jingcong Liang
...
Lin Sun
Jiayu Lin
Jie Zhou
Xuanjing Huang
Zhongyu Wei
LLMAG
LM&Ro
AI4CE
104
14
0
04 Dec 2024
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following
  Models Need for Efficient Generation
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation
Ao Wang
Hui Chen
Jianchao Tan
Kai Zhang
Xunliang Cai
Zijia Lin
Jiawei Han
Guiguang Ding
VLM
90
3
0
04 Dec 2024
TASR: Timestep-Aware Diffusion Model for Image Super-Resolution
TASR: Timestep-Aware Diffusion Model for Image Super-Resolution
Qinwei Lin
Xiaopeng Sun
Yu Gao
Yujie Zhong
Dengjie Li
Zheng Zhao
Haoqian Wang
86
0
0
04 Dec 2024
Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large
  Vision-Language Model via Causality Analysis
Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large Vision-Language Model via Causality Analysis
Po-Hsuan Huang
Jeng-Lin Li
Chin-Po Chen
Ming-Ching Chang
Wei-Chao Chen
LRM
82
1
0
04 Dec 2024
DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation
DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation
Qu He
Jinlong Peng
P. Xu
Boyuan Jiang
Xiaobin Hu
...
Yang Liu
Yun Wang
Chengjie Wang
Xuelong Li
Jingyang Zhang
DiffM
125
1
0
04 Dec 2024
AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?
AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?
Shouwei Ruan
Hanqin Liu
Yao Huang
Xiaoqi Wang
Caixin Kang
Hang Su
Yinpeng Dong
Xingxing Wei
VGen
105
0
0
04 Dec 2024
Enhancing Trust in Large Language Models with Uncertainty-Aware
  Fine-Tuning
Enhancing Trust in Large Language Models with Uncertainty-Aware Fine-Tuning
R. Krishnan
Piyush Khanna
Omesh Tickoo
HILM
72
1
0
03 Dec 2024
Cosmos-LLaVA: Chatting with the Visual Cosmos-LLaVA: Görselle Sohbet
  Etmek
Cosmos-LLaVA: Chatting with the Visual Cosmos-LLaVA: Görselle Sohbet Etmek
Ahmed Zeer
Eren Dogan
Yusuf Erdem
Elif Ince
Osama Shbib
M. E. Uzun
Atahan Uz
M. K. Yuce
Himmet Toprak Kesgin
M. Fatih Amasyali
VLM
84
0
0
03 Dec 2024
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand
  Audio-Visual Information?
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Kaixiong Gong
Kaituo Feng
Yangqiu Song
Yibing Wang
Mofan Cheng
...
Jiaming Han
Benyou Wang
Yutong Bai
Zheng Yang
Xiangyu Yue
MLLM
AuLLM
VLM
91
6
0
03 Dec 2024
SJTU:Spatial judgments in multimodal models towards unified segmentation
  through coordinate detection
SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection
Joongwon Chae
Zhenyu Wang
Peiwu Qin
VLM
87
0
0
03 Dec 2024
Composing Open-domain Vision with RAG for Ocean Monitoring and
  Conservation
Composing Open-domain Vision with RAG for Ocean Monitoring and Conservation
Sepand Dyanatkar
Angran Li
Alexander Dungate
64
0
0
03 Dec 2024
AccDiffusion v2: Towards More Accurate Higher-Resolution Diffusion
  Extrapolation
AccDiffusion v2: Towards More Accurate Higher-Resolution Diffusion Extrapolation
Zhihang Lin
Mingbao Lin
Wengyi Zhan
Rongrong Ji
80
0
0
03 Dec 2024
Progress-Aware Video Frame Captioning
Progress-Aware Video Frame Captioning
Zihui Xue
Joungbin An
Xitong Yang
Kristen Grauman
114
1
0
03 Dec 2024
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning
Xueqing Wu
Yuheng Ding
Bingxuan Li
Pan Lu
Da Yin
Kai-Wei Chang
Nanyun Peng
LRM
108
3
0
03 Dec 2024
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision
  Language Models
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
Byung-Kwan Lee
Ryo Hachiuma
Yu-Chiang Frank Wang
Y. Ro
Yueh-Hua Wu
VLM
88
0
0
02 Dec 2024
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
Meng Cao
Haoran Tang
Haoze Zhao
Hangyu Guo
Jing Liu
Ge Zhang
Ruyang Liu
Qiang Sun
Ian Reid
Xiaodan Liang
115
2
0
02 Dec 2024
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for
  Joint Video Highlight Detection and Moment Retrieval
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval
Dhiman Paul
Md Rizwan Parvez
Nabeel Mohammed
Shafin Rahman
VGen
85
0
0
02 Dec 2024
The Bare Necessities: Designing Simple, Effective Open-Vocabulary Scene
  Graphs
The Bare Necessities: Designing Simple, Effective Open-Vocabulary Scene Graphs
Christina Kassab
Matías Mattamala
Sacha Morin
Martin Buchner
Abhinav Valada
Liam Paull
Maurice F. Fallon
93
4
0
02 Dec 2024
Enhancing Perception Capabilities of Multimodal LLMs with Training-Free
  Fusion
Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion
Zhuokun Chen
Jinwu Hu
Zeshuai Deng
Yufeng Wang
Bohan Zhuang
Mingkui Tan
76
0
0
02 Dec 2024
Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile
  Vision-Language Model
Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model
Qianhan Feng
Wenshuo Li
Tong Lin
Xinghao Chen
VLM
82
0
0
02 Dec 2024
Quantization-Aware Imitation-Learning for Resource-Efficient Robotic
  Control
Quantization-Aware Imitation-Learning for Resource-Efficient Robotic Control
Seongmin Park
Hyungmin Kim
Wonseok Jeon
Juyoung Yang
Byeongwook Jeon
Yoonseon Oh
Jungwook Choi
101
1
0
02 Dec 2024
SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model
SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model
Chunlin Yu
Hanqing Wang
Ye Shi
Haoyang Luo
Sibei Yang
Jingyi Yu
Jingya Wang
LRM
LM&Ro
102
1
0
02 Dec 2024
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
Qizhe Zhang
Aosong Cheng
Ming Lu
Zhiyong Zhuo
Minqi Wang
Jiajun Cao
Shaobo Guo
Qi She
Shanghang Zhang
VLM
105
11
0
02 Dec 2024
Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis
Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis
Anton Voronov
Denis Kuznedelev
Mikhail Khoroshikh
Valentin Khrulkov
Dmitry Baranchuk
122
2
0
02 Dec 2024
OBI-Bench: Can LMMs Aid in Study of Ancient Script on Oracle Bones?
OBI-Bench: Can LMMs Aid in Study of Ancient Script on Oracle Bones?
Z. Chen
Tingzhu Chen
Wenjun Zhang
Guangtao Zhai
101
3
0
02 Dec 2024
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
Sanghwan Kim
Rui Xiao
Mariana-Iuliana Georgescu
Stephan Alaniz
Zeynep Akata
VLM
101
2
0
02 Dec 2024
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
Hongyan Zhi
Peihao Chen
Junyan Li
Shuailei Ma
Xinyu Sun
Tianhang Xiang
Yinjie Lei
Mingkui Tan
Chuang Gan
95
3
0
02 Dec 2024
Beyond Pixels: Text Enhances Generalization in Real-World Image
  Restoration
Beyond Pixels: Text Enhances Generalization in Real-World Image Restoration
Haoze Sun
Wenbo Li
Jiaheng Liu
Kaiwen Zhou
Yongqiang Chen
Yong Guo
Yunshui Li
Renjing Pei
Long Peng
Yue Yang
DiffM
83
1
0
01 Dec 2024
AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal
  Alignment
AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment
Yan Li
Yifei Xing
X. Lan
Xuzhao Li
Haifeng Chen
D. Jiang
Mamba
84
1
0
01 Dec 2024
Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding
Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding
Zilin Du
Haoxin Li
Jianfei Yu
Boyang Li
260
0
0
01 Dec 2024
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
Yogesh Kulkarni
Pooyan Fazli
VLM
117
2
0
01 Dec 2024
Instant3dit: Multiview Inpainting for Fast Editing of 3D Objects
Instant3dit: Multiview Inpainting for Fast Editing of 3D Objects
Amir Barda
Matheus Gadelha
Vladimir G. Kim
Noam Aigerman
Amit H. Bermano
Thibault Groueix
DiffM
3DGS
82
2
0
30 Nov 2024
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
Xubing Ye
Yukang Gan
Yixiao Ge
Xiao Zhang
Yansong Tang
103
7
0
30 Nov 2024
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction
  with 3D Autonomous Characters
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters
Jianping Jiang
Weiye Xiao
Zhengyu Lin
Han Zhang
Tianxiang Ren
Yang Gao
Zhiqian Lin
Zhongang Cai
Lei Yang
Ziwei Liu
90
3
0
29 Nov 2024
Dual Risk Minimization: Towards Next-Level Robustness in Fine-tuning
  Zero-Shot Models
Dual Risk Minimization: Towards Next-Level Robustness in Fine-tuning Zero-Shot Models
Kaican Li
Weiyan Xie
Yongxiang Huang
Didan Deng
Lanqing Hong
Zechao Li
Ricardo Silva
N. Zhang
79
0
0
29 Nov 2024
CogACT: A Foundational Vision-Language-Action Model for Synergizing
  Cognition and Action in Robotic Manipulation
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
Qixiu Li
Yaobo Liang
Zeyu Wang
Lin Luo
Xi Chen
...
Jianmin Bao
Dong Chen
Yuanchun Shi
Jiaolong Yang
B. Guo
LM&Ro
89
25
0
29 Nov 2024
ForgerySleuth: Empowering Multimodal Large Language Models for Image
  Manipulation Detection
ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection
Zhihao Sun
Haoran Jiang
Haoran Chen
Yixin Cao
Xipeng Qiu
Zuxuan Wu
Yu-Gang Jiang
78
2
0
29 Nov 2024
On Domain-Specific Post-Training for Multimodal Large Language Models
On Domain-Specific Post-Training for Multimodal Large Language Models
Daixuan Cheng
Shaohan Huang
Ziyu Zhu
Xintong Zhang
Wayne Xin Zhao
Zhongzhi Luan
Bo Dai
Zhenliang Zhang
VLM
102
2
0
29 Nov 2024
Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers
Chancharik Mitra
Brandon Huang
Tianning Chai
Zhiqiu Lin
Assaf Arbelle
Rogerio Feris
Leonid Karlinsky
Trevor Darrell
Deva Ramanan
Roei Herzig
VLM
140
4
0
28 Nov 2024
Automatic Prompt Generation and Grounding Object Detection for Zero-Shot
  Image Anomaly Detection
Automatic Prompt Generation and Grounding Object Detection for Zero-Shot Image Anomaly Detection
Tsun-hin Cheung
Ka-Chun Fung
Songjiang Lai
Kwan-Ho Lin
Vincent To-Yee NG
K. Lam
77
0
0
28 Nov 2024
Open-Sora Plan: Open-Source Large Video Generation Model
Bin Lin
Yunyang Ge
Xinhua Cheng
Zongjian Li
Bin Zhu
...
Zhang Pan
Xing Zhou
Shaoling Dong
Yonghong Tian
Li-xin Yuan
VLM
VGen
126
60
0
28 Nov 2024
I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt
  Generation for Text-Guided Multi-Mask Inpainting
I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting
Nicola Fanelli
G. Vessio
Giovanna Castellano
MLLM
DiffM
95
1
0
28 Nov 2024
SPAgent: Adaptive Task Decomposition and Model Selection for General
  Video Generation and Editing
SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing
Rong-Cheng Tu
Wenhao Sun
Zhao Jin
Jingyi Liao
Jiaxing Huang
Dacheng Tao
VGen
DiffM
117
3
0
28 Nov 2024
Puzzle: Distillation-Based NAS for Inference-Optimized LLMs
Puzzle: Distillation-Based NAS for Inference-Optimized LLMs
Akhiad Bercovich
Tomer Ronen
Talor Abramovich
Nir Ailon
Nave Assaf
...
Ido Shahaf
Oren Tropp
Omer Ullman Argov
Ran Zilberstein
Ran El-Yaniv
94
1
0
28 Nov 2024
Previous
123...192021...646566
Next