ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,350 papers shown
Title
Generating CAD Code with Vision-Language Models for 3D Designs
Generating CAD Code with Vision-Language Models for 3D Designs
Kamel Alrashedy
Pradyumna Tambwekar
Z. Zaidi
Megan Langwasser
Wei Xu
Matthew Gombolay
101
13
0
07 Oct 2024
Geometric Analysis of Reasoning Trajectories: A Phase Space Approach to Understanding Valid and Invalid Multi-Hop Reasoning in LLMs
Geometric Analysis of Reasoning Trajectories: A Phase Space Approach to Understanding Valid and Invalid Multi-Hop Reasoning in LLMs
Javier Marin
LRM
143
0
0
06 Oct 2024
TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio
  Motion Embedding and Diffusion Interpolation
TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation
Haiyang Liu
Xingchao Yang
Tomoya Akiyama
Yuantian Huang
Qiaoge Li
Shigeru Kuriyama
Takafumi Taketomi
VGenSLR
75
10
0
05 Oct 2024
Solution for OOD-CV UNICORN Challenge 2024 Object Detection Assistance
  LLM Counting Ability Improvement
Solution for OOD-CV UNICORN Challenge 2024 Object Detection Assistance LLM Counting Ability Improvement
Zhouyang Chi
Qingyuan Jiang
Yang Yang
28
0
0
05 Oct 2024
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video
  Large Language Models
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
Haibo Wang
Zhiyang Xu
Yu Cheng
Shizhe Diao
Yufan Zhou
Yixin Cao
Qifan Wang
Weifeng Ge
Lifu Huang
91
26
0
04 Oct 2024
SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models
SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models
Yue Zhang
Zhiyang Xu
Ying Shen
Parisa Kordjamshidi
Lifu Huang
131
8
0
04 Oct 2024
Frame-Voyager: Learning to Query Frames for Video Large Language Models
Frame-Voyager: Learning to Query Frames for Video Large Language Models
Sicheng Yu
Chengkai Jin
Huanyu Wang
Zhenghao Chen
Sheng Jin
...
Zhenbang Sun
Bingni Zhang
Jiawei Wu
Hao Zhang
Qianru Sun
171
9
0
04 Oct 2024
Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding with LLMs
Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding with LLMs
Wei Wu
Chao Wang
L. Chen
Mingze Yin
Yiheng Zhu
Kun Fu
Jieping Ye
Hui Xiong
Zheng Wang
147
1
0
04 Oct 2024
Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models
Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models
Xin Zou
Yizhou Wang
Yibo Yan
Yuanhuiyi Lyu
Kening Zheng
...
Junkai Chen
Peijie Jiang
Qingbin Liu
Chang Tang
Xuming Hu
171
8
0
04 Oct 2024
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Wenhao Chai
Enxin Song
Y. Du
Chenlin Meng
Vashisht Madhavan
Omer Bar-Tal
Jeng-Neng Hwang
Saining Xie
Christopher D. Manning
3DV
219
37
0
04 Oct 2024
RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through
  Language Descriptions
RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions
Ziyao Zeng
Yangchao Wu
Hyoungseob Park
Daniel Wang
Fengyu Yang
Stefano Soatto
Dong Lao
Byung-Woo Hong
Alex Wong
MDE
101
7
0
03 Oct 2024
DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes
  and Objects
DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects
Zhaowei Wang
Hongming Zhang
Tianqing Fang
Ye Tian
Yue Yang
Kaixin Ma
Xiaoman Pan
Yangqiu Song
Dong Yu
LM&Ro
110
3
0
03 Oct 2024
Video Instruction Tuning With Synthetic Data
Video Instruction Tuning With Synthetic Data
Yuanhan Zhang
Jinming Wu
Wei Li
Bo Li
Zejun Ma
Ziwei Liu
Chunyuan Li
SyDaVGen
120
215
0
03 Oct 2024
Distilling an End-to-End Voice Assistant Without Instruction Training
  Data
Distilling an End-to-End Voice Assistant Without Instruction Training Data
William B. Held
Ella Li
Michael Joseph Ryan
Weiyan Shi
Yanzhe Zhang
Diyi Yang
AuLLM
87
16
0
03 Oct 2024
NL-Eye: Abductive NLI for Images
NL-Eye: Abductive NLI for Images
Mor Ventura
Michael Toker
Nitay Calderon
Zorik Gekhman
Yonatan Bitton
Roi Reichart
83
1
0
03 Oct 2024
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations
Nick Jiang
Anish Kachinthaya
Suzie Petryk
Yossi Gandelsman
VLM
121
28
0
03 Oct 2024
EXGRA-MED: Extended Context Graph Alignment for Medical Vision- Language Models
EXGRA-MED: Extended Context Graph Alignment for Medical Vision- Language Models
Duy Minh Ho Nguyen
Nghiem Tuong Diep
Trung Quoc Nguyen
Hoang-Bao Le
Tai Nguyen
...
P. Xie
Roger Wattenhofer
James Zhou
Daniel Sonntag
Mathias Niepert
VLM
139
4
0
03 Oct 2024
DisEnvisioner: Disentangled and Enriched Visual Prompt for Customized
  Image Generation
DisEnvisioner: Disentangled and Enriched Visual Prompt for Customized Image Generation
Jing He
Haodong Li
Yongzhe Hu
Guibao Shen
Yingjie Cai
Weichao Qiu
Ying-Cong Chen
DiffM
97
4
0
02 Oct 2024
SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for
  Remote Sensing Images
SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images
Kaiyu Li
Ruixun Liu
Xiangyong Cao
Deyu Meng
Zhi Wang
Deyu Meng
Zhi Wang
81
3
0
02 Oct 2024
Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning
Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning
Jianxiong Li
Zhihao Wang
Jinliang Zheng
Xiaoai Zhou
Guanming Wang
...
Yu Liu
Jingjing Liu
Ya-Qin Zhang
Junzhi Yu
Xianyuan Zhan
78
2
0
02 Oct 2024
OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the
  Understanding of Occluded Objects with Self-Supervised Test-Time Learning
OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the Understanding of Occluded Objects with Self-Supervised Test-Time Learning
Shuxin Yang
Xinhan Di
57
1
0
02 Oct 2024
OCC-MLLM:Empowering Multimodal Large Language Model For the
  Understanding of Occluded Objects
OCC-MLLM:Empowering Multimodal Large Language Model For the Understanding of Occluded Objects
Wenmo Qiu
Xinhan Di
VLM
84
2
0
02 Oct 2024
UAL-Bench: The First Comprehensive Unusual Activity Localization
  Benchmark
UAL-Bench: The First Comprehensive Unusual Activity Localization Benchmark
Hasnat Md Abdullah
Tian Liu
Kangda Wei
Shu Kong
Ruihong Huang
85
4
0
02 Oct 2024
EMMA: Efficient Visual Alignment in Multi-Modal LLMs
EMMA: Efficient Visual Alignment in Multi-Modal LLMs
Sara Ghazanfari
Alexandre Araujo
Prashanth Krishnamurthy
Siddharth Garg
Farshad Khorrami
VLM
83
2
0
02 Oct 2024
Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker
Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker
Xinlong Hou
Sen Shen
Xueshen Li
Xinran Gao
Ziyi Huang
Steven J. Holiday
Matthew R. Cribbet
Susan W. White
Edward Sazonov
Yu Gan
128
0
0
02 Oct 2024
Backdooring Vision-Language Models with Out-Of-Distribution Data
Backdooring Vision-Language Models with Out-Of-Distribution Data
Weimin Lyu
Jiachen Yao
Saumya Gupta
Lu Pang
Tao Sun
Lingjie Yi
Lijie Hu
Haibin Ling
Chao Chen
VLMAAML
144
8
0
02 Oct 2024
Removing Distributional Discrepancies in Captions Improves Image-Text
  Alignment
Removing Distributional Discrepancies in Captions Improves Image-Text Alignment
Yuheng Li
Haotian Liu
Mu Cai
Yijun Li
Eli Shechtman
Zhe Lin
Yong Jae Lee
Krishna Kumar Singh
VLM
417
4
0
01 Oct 2024
A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common
  Sense Reasoning
A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning
Niki Maria Foteinopoulou
Enjie Ghorbel
Djamila Aouada
136
4
0
01 Oct 2024
Scene Graph Disentanglement and Composition for Generalizable Complex
  Image Generation
Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation
Yunnan Wang
Ziqiang Li
Zequn Zhang
Wenyao Zhang
Baao Xie
Xihui Liu
Wenjun Zeng
Xin Jin
CoGeDiffM
68
3
0
01 Oct 2024
Ask, Pose, Unite: Scaling Data Acquisition for Close Interactions with
  Vision Language Models
Ask, Pose, Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models
Laura Bravo Sánchez
Jaewoo Heo
Zhenzhen Weng
Kuan-Chieh Wang
Serena Yeung-Levy
3DH
99
0
0
01 Oct 2024
Find Everything: A General Vision Language Model Approach to Multi-Object Search
Find Everything: A General Vision Language Model Approach to Multi-Object Search
Daniel Choi
Angus Fung
Haitong Wang
Aaron Hao Tan
119
3
0
01 Oct 2024
Probing Mechanical Reasoning in Large Vision Language Models
Probing Mechanical Reasoning in Large Vision Language Models
Haoran Sun
Qingying Gao
Haiyun Lyu
Dezhi Luo
Yijiang Li
Hokin Deng
LRM
114
2
0
01 Oct 2024
Vision Language Models See What You Want but not What You See
Vision Language Models See What You Want but not What You See
Qingying Gao
Yijiang Li
Haiyun Lyu
Haoran Sun
Dezhi Luo
Hokin Deng
LRMVLM
135
5
0
01 Oct 2024
Vision Language Models Know Law of Conservation without Understanding More-or-Less
Vision Language Models Know Law of Conservation without Understanding More-or-Less
Dezhi Luo
Haiyun Lyu
Qingying Gao
Haoran Sun
Yijiang Li
Hokin Deng
71
1
0
01 Oct 2024
DreamStruct: Understanding Slides and User Interfaces via Synthetic Data
  Generation
DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation
Yi-Hao Peng
Faria Huq
Yue Jiang
Jason Wu
Amanda Li
Jeffrey P. Bigham
Amy Pavel
DiffM
84
5
0
30 Sep 2024
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Haotian Zhang
Mingfei Gao
Zhe Gan
Philipp Dufter
Nina Wenzel
...
Haoxuan You
Zirui Wang
Afshin Dehghan
Peter Grasch
Yinfei Yang
VLMMLLM
133
41
1
30 Sep 2024
Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained
  Transformers
Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers
Lirui Wang
Xinlei Chen
Jialiang Zhao
Kaiming He
73
44
0
30 Sep 2024
PerCo (SD): Open Perceptual Compression
PerCo (SD): Open Perceptual Compression
Nikolai Korber
Eduard Kromer
Andreas Siebert
S. Hauke
Daniel Mueller-Gritschneder
Björn Schuller
71
5
0
30 Sep 2024
VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot
  Anomaly Detection
VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection
Huilin Deng
Hongchen Luo
Wei Zhai
Yang Cao
Yu Kang
79
2
0
30 Sep 2024
Visual Context Window Extension: A New Perspective for Long Video
  Understanding
Visual Context Window Extension: A New Perspective for Long Video Understanding
Hongchen Wei
Zhenzhong Chen
VLM
88
6
0
30 Sep 2024
Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration
Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration
Kaihang Pan
Zhaoyu Fan
Juncheng Li
Qifan Yu
Hao Fei
Siliang Tang
Richang Hong
Hanwang Zhang
Qianru Sun
KELM
111
10
0
30 Sep 2024
Textual Training for the Hassle-Free Removal of Unwanted Visual Data:
  Case Studies on OOD and Hateful Image Detection
Textual Training for the Hassle-Free Removal of Unwanted Visual Data: Case Studies on OOD and Hateful Image Detection
Saehyung Lee
J. Mok
Sangha Park
Yongho Shin
Dahuin Jung
Sungroh Yoon
87
0
0
30 Sep 2024
UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models
UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models
Qiaojun Yu
Siyuan Huang
Xibin Yuan
Zhengkai Jiang
Ce Hao
...
Junbo Wang
Liu Liu
Hongsheng Li
Peng Gao
Cewu Lu
137
3
0
30 Sep 2024
SSR: Alignment-Aware Modality Connector for Speech Language Models
SSR: Alignment-Aware Modality Connector for Speech Language Models
Weiting Tan
Hirofumi Inaguma
Ning Dong
Paden Tomasello
Xutai Ma
128
6
0
30 Sep 2024
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation
Kun Yuan
V. Srivastav
Nassir Navab
N. Padoy
140
9
0
30 Sep 2024
T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness
  Recognition
T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition
Chen Yeh
You-Ming Chang
Wei-Chen Chiu
Ning Yu
66
2
0
29 Sep 2024
One Token to Seg Them All: Language Instructed Reasoning Segmentation in
  Videos
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
Zechen Bai
Tong He
Haiyang Mei
Pichao Wang
Ziteng Gao
Joya Chen
Lei Liu
Zheng Zhang
Mike Zheng Shou
VLMVOSMLLM
91
27
0
29 Sep 2024
RoboNurse-VLA: Robotic Scrub Nurse System based on
  Vision-Language-Action Model
RoboNurse-VLA: Robotic Scrub Nurse System based on Vision-Language-Action Model
Shunlei Li
Jin Wang
Rui Dai
Wanyu Ma
Wing Yin Ng
Yingbai Hu
Zheng Li
37
3
0
29 Sep 2024
Video DataFlywheel: Resolving the Impossible Data Trinity in
  Video-Language Understanding
Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding
Xiao Wang
Jianlong Wu
Zijia Lin
Fuzheng Zhang
Di Zhang
Liqiang Nie
VGen
71
3
0
29 Sep 2024
DENEB: A Hallucination-Robust Automatic Evaluation Metric for Image
  Captioning
DENEB: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning
Kazuki Matsuda
Yuiga Wada
Komei Sugiura
61
1
0
28 Sep 2024
Previous
123...232425...454647
Next