ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.10592
  4. Cited By
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large
  Language Models

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

20 April 2023
Deyao Zhu
Jun Chen
Xiaoqian Shen
Xiang Li
Mohamed Elhoseiny
    VLM
    MLLM
ArXivPDFHTML

Papers citing "MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models"

50 / 361 papers shown
Title
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Shaolei Zhang
Qingkai Fang
Zhe Yang
Yang Feng
MLLM
VLM
69
25
0
07 Jan 2025
Instruction-Guided Scene Text Recognition
Instruction-Guided Scene Text Recognition
Yongkun Du
Z. Chen
Yuchen Su
Caiyan Jia
Yu-Gang Jiang
75
3
0
03 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLM
VLM
LRM
102
48
0
03 Jan 2025
Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs
Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs
Linhao Huang
Xue Jiang
Zhiqiang Wang
Wentao Mo
Xi Xiao
Bo Han
Yongjie Yin
Feng Zheng
AAML
53
2
0
02 Jan 2025
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
Zhangyang Qi
Zhixiong Zhang
Ye Fang
Jiaqi Wang
Hengshuang Zhao
83
6
0
02 Jan 2025
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Wenqi Zhang
Hang Zhang
Xin Li
Jiashuo Sun
Yongliang Shen
Weiming Lu
Deli Zhao
Yueting Zhuang
Lidong Bing
VLM
43
2
0
01 Jan 2025
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
Pinelopi Papalampidi
Skanda Koppula
Shreya Pathak
Justin T Chiu
Joseph Heyward
Viorica Patraucean
Jiajun Shen
Antoine Miech
Andrew Zisserman
Aida Nematzdeh
VLM
63
24
0
31 Dec 2024
In-Context Learning with Iterative Demonstration Selection
In-Context Learning with Iterative Demonstration Selection
Chengwei Qin
Aston Zhang
Cheng Chen
Anirudh Dagar
Wenming Ye
LRM
70
38
0
31 Dec 2024
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
Chenxin Tao
Shiqian Su
X. Zhu
Chenyu Zhang
Zhe Chen
...
Wenhai Wang
Lewei Lu
Gao Huang
Yu Qiao
Jifeng Dai
MLLM
VLM
104
2
0
20 Dec 2024
Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection
Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection
Le Yang
Ziwei Zheng
Boxu Chen
Zhengyu Zhao
Chenhao Lin
Chao Shen
VLM
140
3
0
18 Dec 2024
Empowering LLMs to Understand and Generate Complex Vector Graphics
Empowering LLMs to Understand and Generate Complex Vector Graphics
Ximing Xing
Juncheng Hu
Guotao Liang
Jing Zhang
Dong Xu
Qian Yu
94
7
0
15 Dec 2024
Olympus: A Universal Task Router for Computer Vision Tasks
Olympus: A Universal Task Router for Computer Vision Tasks
Yuanze Lin
Yunsheng Li
Dongdong Chen
Weijian Xu
Ronald Clark
Philip Torr
VLM
ObjD
200
0
0
12 Dec 2024
MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization
MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization
Kangyu Zhu
Peng Xia
Yun-Qing Li
Hongtu Zhu
Sheng Wang
Huaxiu Yao
103
1
0
09 Dec 2024
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
Lu Qiu
Yuying Ge
Yi Chen
Yixiao Ge
Ying Shan
Xihui Liu
LLMAG
LRM
98
5
0
05 Dec 2024
AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?
AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?
Shouwei Ruan
Hanqin Liu
Yao Huang
Xiaoqi Wang
Caixin Kang
Hang Su
Yinpeng Dong
Xingxing Wei
VGen
93
0
0
04 Dec 2024
DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation
DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation
Q. He
Jinlong Peng
P. Xu
Boyuan Jiang
Xiaobin Hu
...
Yong Liu
Yitong Wang
Chengjie Wang
Xiaomeng Li
Jianwei Zhang
DiffM
122
1
0
04 Dec 2024
SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model
SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model
Chunlin Yu
Hanqing Wang
Ye Shi
Haoyang Luo
Sibei Yang
Jingyi Yu
Jingya Wang
LRM
LM&Ro
94
1
0
02 Dec 2024
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
Sanghwan Kim
Rui Xiao
Mariana-Iuliana Georgescu
Stephan Alaniz
Zeynep Akata
VLM
85
2
0
02 Dec 2024
ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model
Kunyang Han
Yibo Hu
Mengxue Qu
Hailin Shi
Yao Zhao
Y. X. Wei
MLLM
VLM
3DV
88
1
0
29 Nov 2024
GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding
GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding
Yawen Shao
Wei-dong Zhai
Yuhang Yang
Hongchen Luo
Yang Cao
Zheng-jun Zha
98
1
0
29 Nov 2024
Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads
Siqi Kou
Jiachun Jin
Chang Liu
Ye Ma
Jian Jia
Quan Chen
Peng Jiang
Zhijie Deng
Zhijie Deng
DiffM
VGen
VLM
135
6
0
28 Nov 2024
Libra: Leveraging Temporal Images for Biomedical Radiology Analysis
Libra: Leveraging Temporal Images for Biomedical Radiology Analysis
Xi Zhang
Zaiqiao Meng
Jake Lever
Edmond S. L. Ho
MedIm
96
0
0
28 Nov 2024
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Di Zhang
Jingdi Lei
Junxian Li
Xunzhi Wang
Yong Liu
...
Steve Yang
Jianbo Wu
Peng Ye
Wanli Ouyang
Dongzhan Zhou
OffRL
LRM
107
6
0
27 Nov 2024
SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation
SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation
Claudia Cuttano
Gabriele Trivigno
Gabriele Rosi
Carlo Masone
Giuseppe Averta
VOS
106
2
0
26 Nov 2024
GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis
GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis
Bo Liu
K. Zou
Liming Zhan
Zexin Lu
Xiaoyu Dong
Yidi Chen
Chengqiang Xie
Jiannong Cao
Xiao-Ming Wu
Huazhu Fu
122
0
0
25 Nov 2024
VideoOrion: Tokenizing Object Dynamics in Videos
VideoOrion: Tokenizing Object Dynamics in Videos
Yicheng Feng
Yijiang Li
Wanpeng Zhang
Sipeng Zheng
Zongqing Lu
Sipeng Zheng
Zongqing Lu
109
1
0
25 Nov 2024
Is 'Right' Right? Enhancing Object Orientation Understanding in Multimodal Large Language Models through Egocentric Instruction Tuning
Is 'Right' Right? Enhancing Object Orientation Understanding in Multimodal Large Language Models through Egocentric Instruction Tuning
Ji Hyeok Jung
Eun Tae Kim
S. Kim
Joo Ho Lee
Bumsoo Kim
Buru Chang
VLM
191
0
0
24 Nov 2024
Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts
Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts
Qizhou Chen
Chengyu Wang
Dakan Wang
Taolin Zhang
Wangyue Li
Xiaofeng He
KELM
83
1
0
23 Nov 2024
On the Consistency of Video Large Language Models in Temporal Comprehension
On the Consistency of Video Large Language Models in Temporal Comprehension
Minjoon Jung
Junbin Xiao
Byoung-Tak Zhang
Angela Yao
87
2
0
20 Nov 2024
Teaching VLMs to Localize Specific Objects from In-context Examples
Teaching VLMs to Localize Specific Objects from In-context Examples
Sivan Doveh
Nimrod Shabtay
Wei Lin
Eli Schwartz
Hilde Kuehne
...
Leonid Karlinsky
James Glass
Assaf Arbelle
S. Ullman
Muhammad Jehanzeb Mirza
VLM
103
1
0
20 Nov 2024
Spider: Any-to-Many Multimodal LLM
Spider: Any-to-Many Multimodal LLM
Jinxiang Lai
Jie Zhang
Jun Liu
Jian Li
Xiaocheng Lu
Song Guo
MLLM
69
2
0
14 Nov 2024
Exploring Hierarchical Molecular Graph Representation in Multimodal LLMs
Exploring Hierarchical Molecular Graph Representation in Multimodal LLMs
Chengxin Hu
Hao Li
Yihe Yuan
Jing Li
Ivor Tsang
46
0
0
07 Nov 2024
CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM
CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM
Jingwei Xu
Chenyu Wang
Zibo Zhao
Wen Liu
Yi Ma
Shenghua Gao
55
13
0
07 Nov 2024
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
Shehan Munasinghe
Hanan Gani
Wenqi Zhu
Jiale Cao
Eric P. Xing
Fahad Shahbaz Khan
Salman Khan
MLLM
VGen
VLM
44
6
0
07 Nov 2024
Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward
Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward
Shashi Kumar
Iuliia Thorbecke
Sergio Burdisso
Esaú Villatoro-Tello
Marcelo Errecalde
Kadri Hacioğlu
Pradeep Rangappa
P. Motlícek
A. Ganapathiraju
Andreas Stolcke
55
1
0
06 Nov 2024
One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering
One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering
Deepayan Das
Davide Talon
Massimiliano Mancini
Yiming Wang
Elisa Ricci
41
0
0
04 Nov 2024
UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models
UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models
Sejoon Oh
Yiqiao Jin
Megha Sharma
Donghyun Kim
Eric Ma
Gaurav Verma
Srijan Kumar
65
6
0
03 Nov 2024
On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection
On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection
Xiufeng Song
Xiao Guo
J. Zhang
Qirui Li
Lei Bai
Xiaoming Liu
Guangtao Zhai
Xiaohong Liu
DiffM
VGen
71
9
0
31 Oct 2024
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?
Han Bao
Yue Huang
Yanbo Wang
Jiayi Ye
Xiangqi Wang
Xiuying Chen
Mohamed Elhoseiny
Xuzhi Zhang
Mohamed Elhoseiny
Xiangliang Zhang
47
7
0
28 Oct 2024
Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs)
Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs)
Leander Girrbach
Yiran Huang
Stephan Alaniz
Trevor Darrell
Zeynep Akata
VLM
47
2
0
25 Oct 2024
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Long Xing
Qidong Huang
Xiaoyi Dong
Jiajie Lu
Pan Zhang
...
Yuhang Cao
Conghui He
Jiaqi Wang
Feng Wu
Dahua Lin
VLM
48
26
0
22 Oct 2024
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
Hyungjoo Chae
Namyoung Kim
Kai Tzu-iunn Ong
Minju Gwak
Gwanwoo Song
Jihoon Kim
S. Kim
Dongha Lee
Jinyoung Yeo
LLMAG
33
14
0
17 Oct 2024
TransAgent: Transfer Vision-Language Foundation Models with
  Heterogeneous Agent Collaboration
TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration
Yiwei Guo
Shaobin Zhuang
Kunchang Li
Yu Qiao
Yali Wang
VLM
CLIP
35
0
0
16 Oct 2024
Ctrl-U: Robust Conditional Image Generation via Uncertainty-aware Reward Modeling
Ctrl-U: Robust Conditional Image Generation via Uncertainty-aware Reward Modeling
Guiyu Zhang
Huan-ang Gao
Zijian Jiang
Hao Zhao
Zhedong Zheng
EGVM
52
6
0
15 Oct 2024
3DArticCyclists: Generating Synthetic Articulated 8D Pose-Controllable Cyclist Data for Computer Vision Applications
3DArticCyclists: Generating Synthetic Articulated 8D Pose-Controllable Cyclist Data for Computer Vision Applications
Eduardo R. Corral-Soto
Yang Liu
Tongtong Cao
Y. Ren
Liu Bingbing
55
5
0
14 Oct 2024
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models
Wenbo Hu
Jia-Chen Gu
Zi-Yi Dou
Mohsen Fayyaz
Pan Lu
Kai-Wei Chang
Nanyun Peng
VLM
66
4
0
10 Oct 2024
MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA
MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA
Hanrong Ye
Haotian Zhang
Erik Daxberger
Lin Chen
Zongyu Lin
...
Haoxuan You
Dan Xu
Zhe Gan
Jiasen Lu
Yinfei Yang
EgoV
MLLM
88
12
0
09 Oct 2024
TRACE: Temporal Grounding Video LLM via Causal Event Modeling
TRACE: Temporal Grounding Video LLM via Causal Event Modeling
Yongxin Guo
Jingyu Liu
Mingda Li
Xiaoying Tang
Qingbin Liu
Xiaoying Tang
39
14
0
08 Oct 2024
GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models
GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models
Muhammad Jehanzeb Mirza
Mengjie Zhao
Zhuoyuan Mao
Sivan Doveh
Wei Lin
...
Yuki Mitsufuji
Horst Possegger
Rogerio Feris
Leonid Karlinsky
James Glass
VLM
84
1
0
08 Oct 2024
Geometric Analysis of Reasoning Trajectories: A Phase Space Approach to Understanding Valid and Invalid Multi-Hop Reasoning in LLMs
Geometric Analysis of Reasoning Trajectories: A Phase Space Approach to Understanding Valid and Invalid Multi-Hop Reasoning in LLMs
Javier Marin
LRM
85
0
0
06 Oct 2024
Previous
12345678
Next