ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.08485
  4. Cited By
Visual Instruction Tuning

Visual Instruction Tuning

17 April 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
    SyDa
    VLM
    MLLM
ArXivPDFHTML

Papers citing "Visual Instruction Tuning"

50 / 3,333 papers shown
Title
TurtleBench: A Visual Programming Benchmark in Turtle Geometry
TurtleBench: A Visual Programming Benchmark in Turtle Geometry
Sina Rismanchian
Yasaman Razeghi
Sameer Singh
Shayan Doroudi
58
1
0
31 Oct 2024
Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map
Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map
Xinyuan Chang
Maixuan Xue
Xinran Liu
Zheng Pan
Xing Wei
81
2
0
31 Oct 2024
MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of
  Low-rank Experts
MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts
Jie Zhu
Yukang Chen
Mingyu Ding
Ping Luo
Leye Wang
Jingdong Wang
DiffM
52
4
0
30 Oct 2024
EMMA: End-to-End Multimodal Model for Autonomous Driving
EMMA: End-to-End Multimodal Model for Autonomous Driving
Jyh-Jing Hwang
Runsheng Xu
Hubert Lin
Wei-Chih Hung
Jingwei Ji
...
Benjamin Sapp
Yin Zhou
James Guo
Dragomir Anguelov
Mingxing Tan
VLM
LM&Ro
63
32
0
30 Oct 2024
Unified Triplet-Level Hallucination Evaluation for Large Vision-Language
  Models
Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models
J. Wu
Tsz Ting Chung
Kai Chen
Dit-Yan Yeung
VLM
LRM
83
3
0
30 Oct 2024
Effective and Efficient Adversarial Detection for Vision-Language Models
  via A Single Vector
Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector
Youcheng Huang
Fengbin Zhu
Jingkun Tang
Pan Zhou
Wenqiang Lei
Jiancheng Lv
Tat-Seng Chua
AAML
44
4
0
30 Oct 2024
Diffusion Beats Autoregressive: An Evaluation of Compositional Generation in Text-to-Image Models
Diffusion Beats Autoregressive: An Evaluation of Compositional Generation in Text-to-Image Models
Arash Marioriyad
Parham Rezaei
M. Baghshah
M. Rohban
CoGe
307
0
0
30 Oct 2024
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
Haiyang Wang
Yue Fan
Muhammad Ferjad Naeem
Yongqin Xian
J. E. Lenssen
Liwei Wang
F. Tombari
Bernt Schiele
54
2
0
30 Oct 2024
Multimodal Structure Preservation Learning
Multimodal Structure Preservation Learning
Chang Liu
Jieshi Chen
Lee H. Harrison
A. Dubrawski
41
0
0
29 Oct 2024
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for
  Vision-Language Model Inference Acceleration
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
Dezhan Tu
Danylo Vashchilenko
Yuzhe Lu
Panpan Xu
VLM
61
9
0
29 Oct 2024
Natural Language Inference Improves Compositionality in Vision-Language
  Models
Natural Language Inference Improves Compositionality in Vision-Language Models
Paola Cascante-Bonilla
Yu Hou
Yang Trista Cao
Hal Daumé III
Rachel Rudinger
ReLM
CoGe
VLM
65
3
0
29 Oct 2024
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous
  Driving
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
Bo Jiang
Shaoyu Chen
Bencheng Liao
Xingyu Zhang
Wei Yin
Qian Zhang
Chang Huang
Wen Liu
Xinyu Wang
VLM
MLLM
LRM
59
22
0
29 Oct 2024
Towards Unifying Understanding and Generation in the Era of Vision
  Foundation Models: A Survey from the Autoregression Perspective
Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective
Shenghao Xie
Wenqiang Zu
Mingyang Zhao
Duo Su
Shilong Liu
Ruohua Shi
Guoqi Li
Shanghang Zhang
Lei Ma
LRM
65
3
0
29 Oct 2024
MotionGPT-2: A General-Purpose Motion-Language Model for Motion
  Generation and Understanding
MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding
Yuan Wang
Di Huang
Yaqi Zhang
Wanli Ouyang
J. Jiao
Xuetao Feng
Yan Zhou
Pengfei Wan
Shixiang Tang
Dan Xu
VGen
57
13
0
29 Oct 2024
A Hierarchical Language Model For Interpretable Graph Reasoning
A Hierarchical Language Model For Interpretable Graph Reasoning
Sambhav Khurana
Xiner Li
Shurui Gui
Shuiwang Ji
LRM
52
0
0
29 Oct 2024
AutoGLM: Autonomous Foundation Agents for GUIs
AutoGLM: Autonomous Foundation Agents for GUIs
Xiao Liu
Bo Qin
Dongzhu Liang
Guang Dong
Hanyu Lai
...
Yujia Wang
Yongjun Xu
Zehan Qi
Yuxiao Dong
Jie Tang
LLMAG
69
13
0
28 Oct 2024
Vision Search Assistant: Empower Vision-Language Models as Multimodal
  Search Engines
Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines
Zhixin Zhang
Yiyuan Zhang
Xiaohan Ding
Xiangyu Yue
48
4
0
28 Oct 2024
Diff-Instruct*: Towards Human-Preferred One-step Text-to-image
  Generative Models
Diff-Instruct*: Towards Human-Preferred One-step Text-to-image Generative Models
Weijian Luo
C. Zhang
Debing Zhang
Zhengyang Geng
35
4
0
28 Oct 2024
Face-MLLM: A Large Face Perception Model
Face-MLLM: A Large Face Perception Model
Haomiao Sun
Mingjie He
Tianheng Lian
Hu Han
Shiguang Shan
VLM
CVBM
LRM
38
5
0
28 Oct 2024
Guide-LLM: An Embodied LLM Agent and Text-Based Topological Map for Robotic Guidance of People with Visual Impairments
Guide-LLM: An Embodied LLM Agent and Text-Based Topological Map for Robotic Guidance of People with Visual Impairments
Sangmim Song
S. Kodagoda
A. Gunatilake
Marc G. Carmichael
Karthick Thiyagarajan
Jodi Martin
LM&Ro
58
1
0
28 Oct 2024
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?
Han Bao
Yue Huang
Yanbo Wang
Jiayi Ye
Xiangqi Wang
Preslav Nakov
Mohamed Elhoseiny
Xuzhi Zhang
Mohamed Elhoseiny
Xiangliang Zhang
60
8
0
28 Oct 2024
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks
Yunhan Zhao
Xiang Zheng
Lin Luo
Yige Li
Xingjun Ma
Yu-Gang Jiang
VLM
AAML
62
4
0
28 Oct 2024
R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest
R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest
Xupeng Chen
Zhixin Lai
Kangrui Ruan
Shichu Chen
Jiaxiang Liu
Zuozhu Liu
56
1
0
27 Oct 2024
Neural Fields in Robotics: A Survey
Neural Fields in Robotics: A Survey
Muhammad Zubair Irshad
Mauro Comi
Yen-Chen Lin
Nick Heppert
Abhinav Valada
Rares Andrei Ambrus
Z. Kira
Jonathan Tremblay
AI4CE
68
4
0
26 Oct 2024
Improving Multimodal Large Language Models Using Continual Learning
Improving Multimodal Large Language Models Using Continual Learning
Shikhar Srivastava
Md Yousuf Harun
Robik Shrestha
Christopher Kanan
KELM
VLM
CLL
45
1
0
25 Oct 2024
FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for
  Multi-Modal Tobacco Content Analysis
FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis
N. V. R. Chappa
P. Dobbs
Bhiksha Raj
Khoa Luu
70
3
0
25 Oct 2024
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained
  Visual Document Understanding
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding
Fengbin Zhu
Ziyang Liu
Xiang Yao Ng
Haohui Wu
Wenjie Wang
Fuli Feng
Chao Wang
Huanbo Luan
Tat-Seng Chua
VLM
51
3
0
25 Oct 2024
Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs)
Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs)
Leander Girrbach
Yiran Huang
Stephan Alaniz
Trevor Darrell
Zeynep Akata
VLM
56
2
0
25 Oct 2024
GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing
GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing
Hosam Elgendy
Ahmed Sharshar
Ahmed Aboeitta
Yasser Ashraf
Mohsen Guizani
40
2
0
25 Oct 2024
Watermarking Large Language Models and the Generated Content:
  Opportunities and Challenges
Watermarking Large Language Models and the Generated Content: Opportunities and Challenges
Ruisi Zhang
F. Koushanfar
WaLM
58
0
0
24 Oct 2024
BIFRÖST: 3D-Aware Image compositing with Language Instructions
BIFRÖST: 3D-Aware Image compositing with Language Instructions
Lingxiao Li
Kaixiong Gong
Weihong Li
Xili Dai
Tao Chen
Xiaojun Yuan
Xiangyu Yue
60
2
0
24 Oct 2024
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark
Sara Ghaboura
Ahmed Heakl
Omkar Thawakar
Ali Alharthi
Ines Riahi
Abduljalil Saif
Jorma T. Laaksonen
Fahad Shahbaz Khan
Salman Khan
Rao Muhammad Anwer
58
1
0
24 Oct 2024
OSCAR: Operating System Control via State-Aware Reasoning and
  Re-Planning
OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning
Xiaoqiang Wang
Bang Liu
LLMAG
LM&Ro
LRM
70
9
0
24 Oct 2024
SegLLM: Multi-round Reasoning Segmentation
SegLLM: Multi-round Reasoning Segmentation
XuDong Wang
Shaolun Zhang
Shufan Li
Konstantinos Kallidromitis
Kehan Li
Yusuke Kato
Kazuki Kozuka
Trevor Darrell
VLM
LRM
58
2
0
24 Oct 2024
Diff-Instruct++: Training One-step Text-to-image Generator Model to
  Align with Human Preferences
Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences
Weijian Luo
EGVM
48
6
0
24 Oct 2024
ChatSearch: a Dataset and a Generative Retrieval Model for General
  Conversational Image Retrieval
ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval
Zijia Zhao
Longteng Guo
Tongtian Yue
Erdong Hu
Shuai Shao
Zehuan Yuan
Hua Huang
Qingbin Liu
38
1
0
24 Oct 2024
DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe
  Dataset Curation
DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation
Yuang Ai
Xiaoqiang Zhou
Huaibo Huang
Xiaotian Han
Zhengyu Chen
Quanzeng You
Hongxia Yang
67
9
0
24 Oct 2024
TripCast: Pre-training of Masked 2D Transformers for Trip Time Series
  Forecasting
TripCast: Pre-training of Masked 2D Transformers for Trip Time Series Forecasting
Yuhua Liao
Zetian Wang
Peng Wei
Qiangqiang Nie
Zhenhua Zhang
AI4TS
23
0
0
24 Oct 2024
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
Lawrence Jang
Yinheng Li
Charles Ding
Justin Lin
Paul Pu Liang
Dan Zhao
Rogerio Bonatti
K. Koishida
51
6
0
24 Oct 2024
Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies
Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies
Liwen Wang
Sheng Chen
Linnan Jiang
Shu Pan
Runze Cai
Sen Yang
Fei Yang
54
5
0
24 Oct 2024
WorldSimBench: Towards Video Generation Models as World Simulators
WorldSimBench: Towards Video Generation Models as World Simulators
Yiran Qin
Zhelun Shi
Jiwen Yu
Xijun Wang
Enshen Zhou
...
Lu Sheng
Jing Shao
Junlin Wu
Wanli Ouyang
Ruimao Zhang
EGVM
VGen
139
402
0
23 Oct 2024
TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing
  Prompts
TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts
Yuxuan Xie
Tianhua Li
Wenqi Shao
Kai Zhang
40
0
0
23 Oct 2024
Scalable Ranked Preference Optimization for Text-to-Image Generation
Scalable Ranked Preference Optimization for Text-to-Image Generation
Shyamgopal Karthik
Huseyin Coskun
Zeynep Akata
Sergey Tulyakov
J. Ren
Anil Kag
EGVM
57
6
0
23 Oct 2024
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language
  Tuning
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning
Zhiwei Hao
Jianyuan Guo
Li Shen
Yong Luo
Han Hu
Yonggang Wen
VLM
61
0
0
23 Oct 2024
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large
  Vision-Language Models
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
Ziyu Liu
Yuhang Zang
Xiaoyi Dong
Pan Zhang
Yuhang Cao
Haodong Duan
Zeang Sheng
Yuanjun Xiong
Dahua Lin
Jiaqi Wang
68
8
0
23 Oct 2024
An Intelligent Agentic System for Complex Image Restoration Problems
An Intelligent Agentic System for Complex Image Restoration Problems
Kaiwen Zhu
Jinjin Gu
Zhiyuan You
Yu Qiao
Chao Dong
53
6
0
23 Oct 2024
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
Kim Sung-Bin
Oh Hyun-Bin
JungMok Lee
Arda Senocak
Joon Son Chung
Tae-Hyun Oh
MLLM
VLM
69
4
0
23 Oct 2024
Theorem-Validated Reverse Chain-of-Thought Problem Generation for Geometric Reasoning
Theorem-Validated Reverse Chain-of-Thought Problem Generation for Geometric Reasoning
Linger Deng
Linghao Zhu
Bohan Li
Dongliang Luo
Liang Wu
Chengquan Zhang
Pengyuan Lyu
Ziyang Zhang
Gang Zhang
ReLM
3DV
LRM
62
13
0
23 Oct 2024
Foundation Models for Rapid Autonomy Validation
Foundation Models for Rapid Autonomy Validation
Alec Farid
Peter Schleede
Aaron Huang
Christoffer Heckman
52
0
0
22 Oct 2024
VistaDream: Sampling multiview consistent images for single-view scene
  reconstruction
VistaDream: Sampling multiview consistent images for single-view scene reconstruction
Haiping Wang
Yuan Liu
Ziwei Liu
Wenping Wang
Z. Dong
Bisheng Yang
61
12
0
22 Oct 2024
Previous
123...242526...656667
Next