ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,340 papers shown
Title
From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance
From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance
Maximilian Dreyer
Lorenz Hufe
J. Berend
Thomas Wiegand
Sebastian Lapuschkin
Wojciech Samek
42
0
0
26 May 2025
MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval
MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval
Rong-Cheng Tu
Zhao Jin
Jingyi Liao
Xiao Luo
Yingjie Wang
Li Shen
Dacheng Tao
115
0
0
26 May 2025
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data
Chun-Yi Kuan
Hung-yi Lee
AuLLM
79
0
0
26 May 2025
Deformable Attentive Visual Enhancement for Referring Segmentation Using Vision-Language Model
Deformable Attentive Visual Enhancement for Referring Segmentation Using Vision-Language Model
Alaa Dalaq
Muzammil Behzad
VLM
198
0
0
25 May 2025
Improving Medical Reasoning with Curriculum-Aware Reinforcement Learning
Improving Medical Reasoning with Curriculum-Aware Reinforcement Learning
Shaohao Rui
Kaitao Chen
Weijie Ma
Xiaosong Wang
OffRLLRM
20
0
0
25 May 2025
Jodi: Unification of Visual Generation and Understanding via Joint Modeling
Jodi: Unification of Visual Generation and Understanding via Joint Modeling
Yifeng Xu
Zhenliang He
Meina Kan
Shiguang Shan
Xilin Chen
VLM
83
0
0
25 May 2025
Using Large Language Models to Tackle Fundamental Challenges in Graph Learning: A Comprehensive Survey
Using Large Language Models to Tackle Fundamental Challenges in Graph Learning: A Comprehensive Survey
Mengran Li
Pengyu Zhang
Wenbin Xing
Yijia Zheng
Klim Zaporojets
...
Jia Hu
Xiaolei Ma
Zhiyuan Liu
Paul Groth
Marcel Worring
AI4CE
146
0
0
24 May 2025
Flex-Judge: Think Once, Judge Anywhere
Flex-Judge: Think Once, Judge Anywhere
Jongwoo Ko
S. Kim
Sungwoo Cho
Se-Young Yun
ELMLRM
218
0
0
24 May 2025
ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models
ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models
Duo Li
Zuhao Yang
Shijian Lu
VLM
96
0
0
24 May 2025
LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs
LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs
Pooneh Mousavi
Shubham Gupta
Cem Subakan
Mirco Ravanelli
51
0
0
24 May 2025
Cultural Awareness in Vision-Language Models: A Cross-Country Exploration
Cultural Awareness in Vision-Language Models: A Cross-Country Exploration
Avinash Madasu
Vasudev Lal
Phillip Howard
VLM
22
0
0
23 May 2025
Decoupled Visual Interpretation and Linguistic Reasoning for Math Problem Solving
Decoupled Visual Interpretation and Linguistic Reasoning for Math Problem Solving
Zixian Guo
Ming-Yu Liu
Zhilong Ji
Jinfeng Bai
Lei Zhang
W. Zuo
LRMVLM
94
0
0
23 May 2025
Co-Reinforcement Learning for Unified Multimodal Understanding and Generation
Co-Reinforcement Learning for Unified Multimodal Understanding and Generation
Jingjing Jiang
Chongjie Si
Jun Luo
Hanwang Zhang
Chao Ma
186
0
0
23 May 2025
DetailFusion: A Dual-branch Framework with Detail Enhancement for Composed Image Retrieval
DetailFusion: A Dual-branch Framework with Detail Enhancement for Composed Image Retrieval
Yuxin Yang
Yinan Zhou
Yuxin Chen
Ziqi Zhang
Zongyang Ma
...
Bing Li
Lin Song
Jun Gao
Peng Li
Weiming Hu
199
0
0
23 May 2025
Learning Shared Representations from Unpaired Data
Learning Shared Representations from Unpaired Data
Amitai Yacobi
Nir Ben-Ari
Ronen Talmon
Uri Shaham
SSL
80
0
0
23 May 2025
The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts
The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts
Yuchen Zhang
Yaxiong Wang
Yujiao Wu
Lianwei Wu
Li Zhu
AAML
107
0
0
23 May 2025
Multi-task Learning For Joint Action and Gesture Recognition
Multi-task Learning For Joint Action and Gesture Recognition
Konstantinos Spathis
N. Kardaris
Petros Maragos
35
0
0
23 May 2025
HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning
Chuhao Zhou
Jianfei Yang
VLM
250
0
0
23 May 2025
Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities
Ziwei Zhou
Rui Wang
Zuxuan Wu
AuLLMVGen
80
0
0
23 May 2025
Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion
Jacob A. Hansen
Wei Lin
Junmo Kang
M. Jehanzeb Mirza
Hongyin Luo
Rogerio Feris
Alan Ritter
James R. Glass
Leonid Karlinsky
VLM
241
0
0
23 May 2025
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM
Donghwan Chi
Hyomin Kim
Yoonjin Oh
Yongjin Kim
Donghoon Lee
DaeJin Jo
Jongmin Kim
Junyeob Baek
Sungjin Ahn
Sungwoong Kim
MLLMVLM
482
0
0
23 May 2025
Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models
Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models
Jiachen Jiang
Jinxin Zhou
Bo Peng
Xia Ning
Zhihui Zhu
102
0
0
22 May 2025
From Evaluation to Defense: Advancing Safety in Video Large Language Models
From Evaluation to Defense: Advancing Safety in Video Large Language Models
Yiwei Sun
Peiqi Jiang
Chuanbin Liu
Luohao Lin
Zhiying Lu
Hongtao Xie
53
0
0
22 May 2025
One-Step Diffusion-Based Image Compression with Semantic Distillation
One-Step Diffusion-Based Image Compression with Semantic Distillation
Naifu Xue
Zhaoyang Jia
Jiahao Li
Bin Li
Yuan Zhang
Yan Lu
DiffM
121
0
0
22 May 2025
VLM-R$^3$: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
VLM-R3^33: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
Chaoya Jiang
Yongrui Heng
Wei Ye
Han Yang
Haiyang Xu
Ming Yan
Ji Zhang
Fei Huang
Shikun Zhang
LRM
73
0
0
22 May 2025
An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability
An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability
Daiqing Wu
Dongbao Yang
Sicheng Zhao
Can Ma
Yu Zhou
45
0
0
22 May 2025
NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment
NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment
Shuhao Han
Haotian Fan
Fangyuan Kong
Wenjie Liao
Chunle Guo
...
Jian Guo
Zhizhuo Shao
Ziyu Feng
Bing Li
Weiming Hu
190
11
0
22 May 2025
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
Shilin Yan
Jiaming Han
Joey Tsai
Hongwei Xue
Rongyao Fang
Lingyi Hong
Ziyu Guo
Ray Zhang
VLM
91
4
0
22 May 2025
Zero-Shot Anomaly Detection in Battery Thermal Images Using Visual Question Answering with Prior Knowledge
Zero-Shot Anomaly Detection in Battery Thermal Images Using Visual Question Answering with Prior Knowledge
Marcella Astrid
Abdelrahman Shabayek
Djamila Aouada
48
0
0
22 May 2025
CHART-6: Human-Centered Evaluation of Data Visualization Understanding in Vision-Language Models
CHART-6: Human-Centered Evaluation of Data Visualization Understanding in Vision-Language Models
Arnav Verma
Kushin Mukherjee
Christopher Potts
Elisa Kreiss
Judith E. Fan
34
0
0
22 May 2025
Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts
Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts
Taewon Kang
Ming C. Lin
DiffMVGen
83
0
0
22 May 2025
Panoptic Captioning: Seeking An Equivalency Bridge for Image and Text
Panoptic Captioning: Seeking An Equivalency Bridge for Image and Text
Kun-Yu Lin
Hongjun Wang
Weining Ren
Kai Han
291
0
0
22 May 2025
VL-SAFE: Vision-Language Guided Safety-Aware Reinforcement Learning with World Models for Autonomous Driving
VL-SAFE: Vision-Language Guided Safety-Aware Reinforcement Learning with World Models for Autonomous Driving
Yansong Qu
Zilin Huang
Zihao Sheng
Jiancong Chen
Sikai Chen
Samuel Labi
OffRL
68
0
0
22 May 2025
TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving
TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving
Hossein Hassani
Soodeh Nikan
Abdallah Shami
MLLM
143
0
0
21 May 2025
Learning Interpretable Representations Leads to Semantically Faithful EEG-to-Text Generation
Learning Interpretable Representations Leads to Semantically Faithful EEG-to-Text Generation
Xiaozhao Liu
Dinggang Shen
Xihui Liu
86
0
0
21 May 2025
ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation
ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation
Tony Montes
Fernando Lozano
60
0
0
21 May 2025
Object-Focus Actor for Data-efficient Robot Generalization Dexterous Manipulation
Object-Focus Actor for Data-efficient Robot Generalization Dexterous Manipulation
Yihang Li
Tianle Zhang
Xuelong Wei
Jiayi Li
Lin Zhao
Dongchi Huang
Zhirui Fang
Minhua Zheng
Wenjun Dai
Xiaodong He
71
0
0
21 May 2025
OViP: Online Vision-Language Preference Learning
OViP: Online Vision-Language Preference Learning
Shujun Liu
Siyuan Wang
Zejun Li
Jianxiang Wang
Cheng Zeng
Zhongyu Wei
MLLMVLM
76
0
0
21 May 2025
Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models
Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models
Xin Huang
Ruibin Li
Tong Jia
Wei Zheng
Ya Wang
VLMCoGe
132
0
0
21 May 2025
Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL
Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL
Xintong Zhang
Zhi Gao
Bofei Zhang
Pengxiang Li
Xiaowen Zhang
...
Tao Yuan
Yuwei Wu
Yunde Jia
Song-Chun Zhu
Qing Li
LRM
124
0
0
21 May 2025
RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation
RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation
Naman Patel
Prashanth Krishnamurthy
Farshad Khorrami
82
0
0
21 May 2025
Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval
Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval
Siting Li
Xiang Gao
Simon Shaolei Du
132
0
0
21 May 2025
SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval
SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval
Nikolaos Chaidos
Angeliki Dimitriou
Maria Lymperaiou
Giorgos Stamou
67
0
0
21 May 2025
Generative AI for Autonomous Driving: A Review
Generative AI for Autonomous Driving: A Review
Katharina Winter
Abhishek Vivekanandan
Rupert Polley
Yinzhe Shen
Christian Schlauch
...
Christian Wirth
Omer Sahin Tas
Nadja Klein
Fabian B. Flohr
Hanno Gottschalk
94
0
0
21 May 2025
Domain Adaptation of VLM for Soccer Video Understanding
Domain Adaptation of VLM for Soccer Video Understanding
Tiancheng Jiang
Henry Wang
Md Sirajus Salekin
Parmida Atighehchian
Shinan Zhang
VLM
98
0
0
20 May 2025
Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels
Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels
Yongshuo Zong
Qin Zhang
Dongsheng An
Zhihua Li
Xiang Xu
Linghan Xu
Zhuowen Tu
Yifan Xing
Onkar Dabeer
ObjD
96
0
0
20 May 2025
VoQA: Visual-only Question Answering
VoQA: Visual-only Question Answering
Luyang Jiang
Jianing An
Jie Luo
Wenjun Wu
Lei Huang
LRM
101
0
0
20 May 2025
Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples
Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples
Chun-Yi Kuan
Hung-yi Lee
88
1
0
20 May 2025
Scaling Vision Mamba Across Resolutions via Fractal Traversal
Scaling Vision Mamba Across Resolutions via Fractal Traversal
Bo Li
Haoke Xiao
Lv Tang
Mamba
126
0
0
20 May 2025
MedBLIP: Fine-tuning BLIP for Medical Image Captioning
MedBLIP: Fine-tuning BLIP for Medical Image Captioning
Manshi Limbu
Diwita Banerjee
LM&MAMedImVLM
116
0
0
20 May 2025
Previous
12345...454647
Next