ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.08485
  4. Cited By
Visual Instruction Tuning

Visual Instruction Tuning

17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
    SyDa
    VLM
    MLLM
ArXivPDFHTML

Papers citing "Visual Instruction Tuning"

50 / 3,279 papers shown
Title
Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads
Siqi Kou
Jiachun Jin
Chang Liu
Ye Ma
Jian Jia
Quan Chen
Peng Jiang
Zhijie Deng
Zhijie Deng
DiffM
VGen
VLM
137
7
0
28 Nov 2024
FaithDiff: Unleashing Diffusion Priors for Faithful Image
  Super-resolution
FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution
Junyang Chen
Jinshan Pan
Jiangxin Dong
80
0
0
27 Nov 2024
Grid-augmented vision: A simple yet effective approach for enhanced
  spatial understanding in multi-modal agents
Grid-augmented vision: A simple yet effective approach for enhanced spatial understanding in multi-modal agents
Joongwon Chae
Zhenyu Wang
Peiwu Qin
Dongmei Yu
Peiwu Qin
71
0
0
27 Nov 2024
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding
  with Superior Temporal Localization Ability
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
Shimin Chen
Xiaohan Lan
Yitian Yuan
Zequn Jie
Lin Ma
VLM
MLLM
92
13
0
27 Nov 2024
Enhancing Visual Reasoning with Autonomous Imagination in Multimodal
  Large Language Models
Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models
Jiaheng Liu
Yumeng Li
Boyuan Xiao
Yichang Jian
Ziang Qin
Tianjia Shao
Yao-Xiang Ding
Kun Zhou
MLLM
LRM
111
4
0
27 Nov 2024
When Large Vision-Language Models Meet Person Re-Identification
When Large Vision-Language Models Meet Person Re-Identification
Qizao Wang
Bin Li
Xiangyang Xue
89
3
0
27 Nov 2024
VLM-HOI: Vision Language Models for Interpretable Human-Object
  Interaction Analysis
VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis
Donggoo Kang
Dasol Jeong
Hyunmin Lee
Sangwoo Park
Hasil Park
Sunkyu Kwon
Yeongjoon Kim
Joonki Paik
MLLM
VLM
88
0
0
27 Nov 2024
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
Qing Jiang
Gen Luo
Yuqin Yang
Yuda Xiong
Yihao Chen
Zhaoyang Zeng
Tianhe Ren
Lei Zhang
VLM
LRM
109
7
0
27 Nov 2024
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Di Zhang
Jingdi Lei
Junxian Li
Xunzhi Wang
Yong Liu
...
Steve Yang
Jianbo Wu
Peng Ye
Wanli Ouyang
Dongzhan Zhou
OffRL
LRM
107
6
0
27 Nov 2024
HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation
HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation
Trong-Thuan Nguyen
Pha Nguyen
J. Cothren
Alper Yilmaz
Khoa Luu
101
1
0
27 Nov 2024
Evaluating Vision-Language Models as Evaluators in Path Planning
Evaluating Vision-Language Models as Evaluators in Path Planning
Mohamed Aghzal
Xiang Yue
Erion Plaku
Ziyu Yao
LRM
82
1
0
27 Nov 2024
NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?
NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?
Jiaxuan Li
Junwen Mo
MinhDuc Vo
Akihiro Sugimoto
Hideki Nakayama
100
0
0
26 Nov 2024
HyperSeg: Towards Universal Visual Segmentation with Large Language
  Model
HyperSeg: Towards Universal Visual Segmentation with Large Language Model
Cong Wei
Yujie Zhong
Haoxian Tan
Yong Liu
Zheng Zhao
Jie Hu
Yujiu Yang
VOS
MLLM
VLM
LRM
91
1
0
26 Nov 2024
CoA: Chain-of-Action for Generative Semantic Labels
CoA: Chain-of-Action for Generative Semantic Labels
Meng Wei
Zhongnian Li
Peng Ying
Xinzheng Xu
VLM
74
0
0
26 Nov 2024
InsightEdit: Towards Better Instruction Following for Image Editing
InsightEdit: Towards Better Instruction Following for Image Editing
Yingjing Xu
Jie Kong
Jiazhi Wang
Xiao Pan
Bo Lin
Qiang Liu
DiffM
96
1
0
26 Nov 2024
A Topic-level Self-Correctional Approach to Mitigate Hallucinations in
  MLLMs
A Topic-level Self-Correctional Approach to Mitigate Hallucinations in MLLMs
Lehan He
Zeren Chen
Zhelun Shi
Tianyu Yu
Jing Shao
Lu Sheng
MLLM
116
1
0
26 Nov 2024
Efficient Multi-modal Large Language Models via Visual Token Grouping
Efficient Multi-modal Large Language Models via Visual Token Grouping
Minbin Huang
Runhui Huang
Han Shi
Yimeng Chen
Chuanyang Zheng
Xiangguo Sun
Xin Jiang
Zhiyu Li
Hong Cheng
VLM
101
4
0
26 Nov 2024
Exploring Aleatoric Uncertainty in Object Detection via Vision
  Foundation Models
Exploring Aleatoric Uncertainty in Object Detection via Vision Foundation Models
Peng Cui
Guande He
Dan Zhang
Zhijie Deng
Yinpeng Dong
Jun Zhu
92
1
0
26 Nov 2024
Efficient Self-Improvement in Multimodal Large Language Models: A
  Model-Level Judge-Free Approach
Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach
Shijian Deng
Wentian Zhao
Yu-Jhe Li
Kun Wan
Daniel Miranda
Ajinkya Kale
Yapeng Tian
LRM
93
6
0
26 Nov 2024
DiagramQG: Concept-Focused Diagram Question Generation via Hierarchical Knowledge Integration
DiagramQG: Concept-Focused Diagram Question Generation via Hierarchical Knowledge Integration
X. Zhang
L. Zhang
Yanrui Wu
Muye Huang
Wenjun Wu
Bo Li
Shaowei Wang
Jun Liu
Jun Liu
71
0
0
26 Nov 2024
COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection
Jinqi Xiao
S. Sang
Tiancheng Zhi
Jing Liu
Qing Yan
Linjie Luo
Bo Yuan
Bo Yuan
VLM
114
1
0
26 Nov 2024
SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation
SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation
Claudia Cuttano
Gabriele Trivigno
Gabriele Rosi
Carlo Masone
Giuseppe Averta
VOS
114
2
0
26 Nov 2024
Factorized Visual Tokenization and Generation
Factorized Visual Tokenization and Generation
Zechen Bai
Jianxiong Gao
Ziteng Gao
Pichao Wang
Zheng Zhang
Tong He
Mike Zheng Shou
95
3
0
25 Nov 2024
Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image
  Synthesis
Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis
Boming Miao
C. Li
Xiaobei Wang
Andi Zhang
Rui Sun
Zizhe Wang
Yao Zhu
DiffM
86
0
0
25 Nov 2024
Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation
Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation
Jungeun Kim
Hyeongwoo Jeon
Jongseong Bae
Ha Young Kim
SLR
93
0
0
25 Nov 2024
MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating
  Multimodal Large Language Models Understanding of Complex Image
MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image
Shezheng Song
Chengxiang He
Shasha Li
Shan Zhao
Chengyu Wang
...
Xiaopeng Li
Qian Wan
Jun Ma
Jie Yu
Xiaoguang Mao
VLM
92
1
0
25 Nov 2024
Language Driven Occupancy Prediction
Language Driven Occupancy Prediction
Zhu Yu
Bowen Pang
Lizhe Liu
Runmin Zhang
Qihao Peng
Maochun Luo
Sheng Yang
Mingxia Chen
Si-Yuan Cao
Hui-Liang Shen
106
2
0
25 Nov 2024
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities
  through Tree-Based Image Exploration
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
Haozhan Shen
Kangjia Zhao
Tiancheng Zhao
Ruochen Xu
Zilun Zhang
Mingwei Zhu
Yuxiang Cai
97
4
0
25 Nov 2024
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
Yongwei Chen
Yushi Lan
Shangchen Zhou
Tengfei Wang
Xingang Pan
112
5
0
25 Nov 2024
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
Chan Hee Song
Valts Blukis
Jonathan Tremblay
Stephen Tyree
Yu-Chuan Su
Stan Birchfield
101
9
0
25 Nov 2024
VideoOrion: Tokenizing Object Dynamics in Videos
VideoOrion: Tokenizing Object Dynamics in Videos
Yicheng Feng
Yijiang Li
Wanpeng Zhang
Sipeng Zheng
Zongqing Lu
Sipeng Zheng
Zongqing Lu
109
1
0
25 Nov 2024
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages
Ashmal Vayani
Dinura Dissanayake
Hasindri Watawana
Noor Ahsan
Nevasini Sasikumar
...
Monojit Choudhury
Ivan Laptev
Mubarak Shah
Salman Khan
Fahad A Khan
124
9
0
25 Nov 2024
GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis
GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis
Bo Liu
K. Zou
Liming Zhan
Zexin Lu
Xiaoyu Dong
Yidi Chen
Chengqiang Xie
Jiannong Cao
Xiao-Ming Wu
Huazhu Fu
134
0
0
25 Nov 2024
Chain of Attack: On the Robustness of Vision-Language Models Against
  Transfer-Based Adversarial Attacks
Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks
Peng Xie
Yequan Bie
Jianda Mao
Yangqiu Song
Yang Wang
Hao Chen
Kani Chen
AAML
86
1
0
24 Nov 2024
ROOT: VLM based System for Indoor Scene Understanding and Beyond
ROOT: VLM based System for Indoor Scene Understanding and Beyond
Yonghui Wang
Shi-Yong Chen
Zhenxing Zhou
Siyi Li
Haoran Li
Wengang Zhou
Haoyang Li
VLM
72
3
0
24 Nov 2024
Is 'Right' Right? Enhancing Object Orientation Understanding in Multimodal Large Language Models through Egocentric Instruction Tuning
Is 'Right' Right? Enhancing Object Orientation Understanding in Multimodal Large Language Models through Egocentric Instruction Tuning
Ji Hyeok Jung
Eun Tae Kim
S. Kim
Joo Ho Lee
Bumsoo Kim
Buru Chang
VLM
308
0
0
24 Nov 2024
AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
Qifan Yu
Wei Chow
Zhongqi Yue
Kaihang Pan
Yang Wu
Xiaoyang Wan
Juncheng Billy Li
Siliang Tang
Hao Zhang
Yueting Zhuang
DiffM
115
18
0
24 Nov 2024
freePruner: A Training-free Approach for Large Multimodal Model
  Acceleration
freePruner: A Training-free Approach for Large Multimodal Model Acceleration
Bingxin Xu
Yuzhang Shang
Yunhao Ge
Qian Lou
Yan Yan
102
3
0
23 Nov 2024
$\textit{Revelio}$: Interpreting and leveraging semantic information in
  diffusion models
Revelio\textit{Revelio}Revelio: Interpreting and leveraging semantic information in diffusion models
Dahye Kim
Xavier Thomas
Deepti Ghadiyaram
96
4
0
23 Nov 2024
FINECAPTION: Compositional Image Captioning Focusing on Wherever You
  Want at Any Granularity
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity
Hang Hua
Qing Liu
Lingzhi Zhang
Jing Shi
Zhifei Zhang
Yilin Wang
Jianming Zhang
Jiebo Luo
CoGe
VLM
103
6
0
23 Nov 2024
Large Language Model with Region-guided Referring and Grounding for CT Report Generation
Large Language Model with Region-guided Referring and Grounding for CT Report Generation
Zhaoyu Chen
Yequan Bie
Haibo Jin
Hao Chen
277
0
0
23 Nov 2024
Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts
Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts
Qizhou Chen
Chengyu Wang
Dakan Wang
Taolin Zhang
Wangyue Li
Xiaofeng He
KELM
88
1
0
23 Nov 2024
Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens
Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens
Zhangqi Jiang
Junkai Chen
Beier Zhu
Tingjin Luo
Yankun Shen
Xu Yang
108
4
0
23 Nov 2024
ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object
  Hallucination in Large Vision-Language Models
ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models
Junzhe Chen
Tianshu Zhang
Shijie Huang
Yuwei Niu
Linfeng Zhang
Lijie Wen
Xuming Hu
MLLM
VLM
299
2
0
22 Nov 2024
Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification
Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification
S P Sharan
Minkyu Choi
Sahil Shah
Harsh Goel
Mohammad Omama
Sandeep Chinchali
EGVM
108
3
0
22 Nov 2024
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual
  Token Compression
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression
Yuke Zhu
Chi Xie
Shuang Liang
Bo Zheng
Sheng Guo
83
9
0
21 Nov 2024
FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers
FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers
Zehua Pei
Hui-Ling Zhen
Xianzhi Yu
Sinno Jialin Pan
M. Yuan
Bei Yu
AI4CE
94
0
0
21 Nov 2024
Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided
  Visual Prompts
Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided Visual Prompts
Honglin Li
Yuting Gao
Chenglu Zhu
Jingdong Chen
M. Yang
Lin Yang
MLLM
100
0
0
21 Nov 2024
IterIS: Iterative Inference-Solving Alignment for LoRA Merging
IterIS: Iterative Inference-Solving Alignment for LoRA Merging
Hongxu Chen
Runshi Li
Bowei Zhu
Zhen Wang
Long Chen
MoMe
98
0
0
21 Nov 2024
Learning to Reason Iteratively and Parallelly for Complex Visual
  Reasoning Scenarios
Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios
Shantanu Jaiswal
Debaditya Roy
Basura Fernando
Cheston Tan
ReLM
LRM
84
2
0
20 Nov 2024
Previous
123...202122...646566
Next