Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,345 papers shown
Title
SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer
Hong Chen
Zihan Wang
Xianrui Li
Xingwu Sun
Fangyi Chen
Jiang Liu
Jiadong Wang
Bhiksha Raj
Zicheng Liu
Emad Barsoum
VLM
288
10
0
14 Dec 2024
Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics
Sara Ghazanfari
Siddharth Garg
Nicolas Flammarion
Prashanth Krishnamurthy
Farshad Khorrami
Francesco Croce
VLM
150
0
0
13 Dec 2024
The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion
Changan Chen
Juze Zhang
S. K. Lakshmikanth
Yusu Fang
Ruizhi Shao
Gordon Wetzstein
L. Fei-Fei
Ehsan Adeli
VGen
135
5
0
13 Dec 2024
Neptune: The Long Orbit to Benchmarking Long Video Understanding
Arsha Nagrani
Ruotong Wang
Ramin Mehran
Rachel Hornung
N. B. Gundavarapu
...
Boqing Gong
Cordelia Schmid
Mikhail Sirotenko
Yukun Zhu
Tobias Weyand
179
8
0
12 Dec 2024
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Zhisheng Zhong
Chengyao Wang
Yuqi Liu
Senqiao Yang
Longxiang Tang
...
Shaozuo Yu
Sitong Wu
Eric Lo
Shu Liu
Jiaya Jia
AuLLM
162
7
0
12 Dec 2024
Falcon-UI: Understanding GUI Before Following User Instructions
Huawen Shen
Chang-Shu Liu
Gengluo Li
Xinlong Wang
Yu Zhou
Can Ma
Xiangyang Ji
LLMAG
158
9
0
12 Dec 2024
Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering
Sai Bhargav Rongali
M. Cui
Ankit Jha
Neha Bhargava
Saurabh Prasad
Biplab Banerjee
119
0
0
12 Dec 2024
Geo-LLaVA: A Large Multi-Modal Model for Solving Geometry Math Problems with Meta In-Context Learning
Shihao Xu
Yiyang Luo
Wei Shi
LRM
ReLM
123
3
0
12 Dec 2024
Omni-ID: Holistic Identity Representation Designed for Generative Tasks
Guocheng Qian
Kuan-Chieh Wang
Or Patashnik
Negin Heravi
Daniil Ostashev
Sergey Tulyakov
Daniel Cohen-Or
Kfir Aberman
142
5
0
12 Dec 2024
Olympus: A Universal Task Router for Computer Vision Tasks
Yuanze Lin
Yunsheng Li
Dongdong Chen
Weijian Xu
Ronald Clark
Philip Torr
VLM
ObjD
548
1
0
12 Dec 2024
TimeRefine: Temporal Grounding with Time Refining Video LLM
Xizi Wang
Feng Cheng
Ziyang Wang
Huiyu Wang
Md. Mohaiminul Islam
Lorenzo Torresani
Joey Tianyi Zhou
Gedas Bertasius
David J. Crandall
212
2
0
12 Dec 2024
LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information
Ke Wang
Hong Xuan
VLM
121
2
0
11 Dec 2024
SenCLIP: Enhancing zero-shot land-use mapping for Sentinel-2 with ground-level prompting
Pallavi Jain
Dino Ienco
R. Interdonato
Tristan Berchoux
Diego Marcos
VLM
135
3
0
11 Dec 2024
CAP: Evaluation of Persuasive and Creative Image Generation
Aysan Aghazadeh
Adriana Kovashka
EGVM
166
2
0
10 Dec 2024
StyleMaster: Stylize Your Video with Artistic Generation and Translation
Zixuan Ye
Huijuan Huang
Xintao Wang
Pengfei Wan
Di Zhang
Wenhan Luo
DiffM
VGen
132
6
0
10 Dec 2024
FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models
Tong Wu
Yinghao Xu
Ryan Po
Mengchen Zhang
Guandao Yang
Jiaqi Wang
Ziqiang Liu
Dahua Lin
Gordon Wetzstein
113
0
0
10 Dec 2024
RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation
Feng Yan
Fanfan Liu
Liming Zheng
Yufeng Zhong
Yiyang Huang
Zechao Guan
Chengjian Feng
Lin Ma
127
3
0
10 Dec 2024
ArtFormer: Controllable Generation of Diverse 3D Articulated Objects
Jiayi Su
Youhe Feng
Zheng Li
Jinhua Song
Yangfan He
Botao Ren
Botian Xu
AI4CE
158
3
0
10 Dec 2024
ContRail: A Framework for Realistic Railway Image Synthesis using ControlNet
Andrei-Robert Alexandrescu
Razvan-Gabriel Petec
Alexandru Manole
Laura-Silvia Diosan
DiffM
112
0
0
09 Dec 2024
LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations
Mingjie Xu
Mengyang Wu
Yuzhi Zhao
Jason Chun Lok Li
Weifeng Ou
LRM
SyDa
VLM
129
4
0
09 Dec 2024
Chimera: Improving Generalist Model with Domain-Specific Experts
Tianshuo Peng
Mingxing Li
Hongbin Zhou
Renqiu Xia
Renrui Zhang
...
Aojun Zhou
Botian Shi
Tao Chen
Bo Zhang
Xiangyu Yue
197
5
0
08 Dec 2024
A Self-Learning Multimodal Approach for Fake News Detection
Hao Chen
Hui Guo
Baochen Hu
Shu Hu
Jinrong Hu
Siwei Lyu
Xi Wu
Xinze Wang
117
2
0
08 Dec 2024
Evaluating Hallucination in Text-to-Image Diffusion Models with Scene-Graph based Question-Answering Agent
Ziyuan Qin
D. Cheng
Haoyu Wang
Huahui Yi
Yuting Shao
Zhiyuan Fan
Kang Li
Qicheng Lao
EGVM
MLLM
472
0
0
07 Dec 2024
SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR
Pengcheng Guo
Xuankai Chang
Hang Lv
Shinji Watanabe
Lei Xie
111
1
0
07 Dec 2024
TANGO: Training-free Embodied AI Agents for Open-world Tasks
Filippo Ziliotto
Tommaso Campari
Luciano Serafini
Lamberto Ballan
LLMAG
LM&Ro
MLLM
LRM
150
2
0
05 Dec 2024
LossAgent: Towards Any Optimization Objectives for Image Processing with LLM Agents
Bingchen Li
Xin Li
Yiting Lu
Zhibo Chen
194
1
0
05 Dec 2024
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
Lu Qiu
Yuying Ge
Yi Chen
Yixiao Ge
Ying Shan
Xihui Liu
LLMAG
LRM
211
8
0
05 Dec 2024
Composed Image Retrieval for Training-Free Domain Conversion
Nikos Efthymiadis
Bill Psomas
Zakaria Laskar
Konstantinos Karantzalos
Yannis Avrithis
Ondřej Chum
Giorgos Tolias
119
0
0
04 Dec 2024
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
Yiwu Zhong
Zhuoming Liu
Yin Li
Liwei Wang
146
7
0
04 Dec 2024
Video LLMs for Temporal Reasoning in Long Videos
Fawad Javed Fateh
Umer Ahmed
Hamza Khan
M. Zia
Quoc-Huy Tran
VLM
186
1
0
04 Dec 2024
AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?
Shouwei Ruan
Hanqin Liu
Yao Huang
Xiaoqi Wang
Caixin Kang
Hang Su
Yinpeng Dong
Xingxing Wei
VGen
222
0
0
04 Dec 2024
DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation
Qu He
Jinlong Peng
P. Xu
Boyuan Jiang
Xiaobin Hu
...
Yang Liu
Yun Wang
Chengjie Wang
Xuelong Li
Jing Zhang
DiffM
215
1
0
04 Dec 2024
Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation
Yiftach Edelstein
Or Patashnik
Dana Cohen-Bar
Lihi Zelnik-Manor
138
0
0
03 Dec 2024
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Kaixiong Gong
Kaituo Feng
Yangqiu Song
Yibing Wang
Mofan Cheng
...
Jiaming Han
Benyou Wang
Yutong Bai
Zhiyong Yang
Xiangyu Yue
MLLM
AuLLM
VLM
128
11
0
03 Dec 2024
SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection
Joongwon Chae
Zhenyu Wang
Peiwu Qin
VLM
103
0
0
03 Dec 2024
Composing Open-domain Vision with RAG for Ocean Monitoring and Conservation
Sepand Dyanatkar
Angran Li
Alexander Dungate
104
0
0
03 Dec 2024
AccDiffusion v2: Towards More Accurate Higher-Resolution Diffusion Extrapolation
Zhihang Lin
Mingbao Lin
Wengyi Zhan
Rongrong Ji
138
0
0
03 Dec 2024
Progress-Aware Video Frame Captioning
Zihui Xue
Joungbin An
Xitong Yang
Kristen Grauman
239
1
0
03 Dec 2024
3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting
Ziyang Yan
Lei Li
Yihua Shao
Siyu Chen
Wuzong Kai
Lei Li
Hao Zhao
Fabio Remondino
3DGS
168
3
0
02 Dec 2024
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval
Dhiman Paul
Md Rizwan Parvez
Nabeel Mohammed
Shafin Rahman
VGen
125
0
0
02 Dec 2024
HaGRIDv2: 1M Images for Static and Dynamic Hand Gesture Recognition
Anton Nuzhdin
Alexander Nagaev
Alexander Sautin
A. Kapitanov
Karina Kvanchiani
EgoV
113
0
0
02 Dec 2024
Behavior Backdoor for Deep Learning Models
Jinqiao Wang
Pengfei Zhang
R. Tao
Jian Yang
Hao Liu
Xianglong Liu
Y. X. Wei
Yao Zhao
AAML
124
0
0
02 Dec 2024
Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion
Zhuokun Chen
Jinwu Hu
Zeshuai Deng
Yufeng Wang
Bohan Zhuang
Mingkui Tan
149
0
0
02 Dec 2024
DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding
Hao Wu
Zhihang Zhong
Xiao Sun
DiffM
113
0
0
02 Dec 2024
IQA-Adapter: Exploring Knowledge Transfer from Image Quality Assessment to Diffusion-based Generative Models
Khaled Abud
Sergey Lavrushkin
Alexey Kirillov
D. Vatolin
216
0
0
02 Dec 2024
SEAL: Semantic Attention Learning for Long Video Representation
Lan Wang
Yujia Chen
Wen-Sheng Chu
Vishnu Boddeti
Du Tran
VLM
248
0
0
02 Dec 2024
OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking
Xinyu Zhang
Zecheng Tang
Zhipei Xu
Runyi Li
Youmin Xu
Bin Chen
Feng Gao
Jian Zhang
WIGM
199
5
0
02 Dec 2024
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
Sanghwan Kim
Rui Xiao
Mariana-Iuliana Georgescu
Stephan Alaniz
Zeynep Akata
VLM
351
3
0
02 Dec 2024
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
Hongyan Zhi
Peihao Chen
Junyan Li
Shuailei Ma
Xinyu Sun
Tianhang Xiang
Yinjie Lei
Mingkui Tan
Chuang Gan
174
8
0
02 Dec 2024
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
Shufan Li
Konstantinos Kallidromitis
Akash Gokul
Zichun Liao
Yusuke Kato
Kazuki Kozuka
Aditya Grover
VGen
177
9
0
02 Dec 2024
Previous
1
2
3
...
17
18
19
...
45
46
47
Next