Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2205.01917
Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models
4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"CoCa: Contrastive Captioners are Image-Text Foundation Models"
50 / 910 papers shown
Title
Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos
Sagnik Majumder
Tushar Nagarajan
Ziad Al-Halah
Reina Pradhan
Kristen Grauman
31
0
0
13 Nov 2024
Classification Done Right for Vision-Language Pre-Training
Zilong Huang
Qinghao Ye
Bingyi Kang
Jiashi Feng
Haoqi Fan
CLIP
VLM
50
2
0
05 Nov 2024
Domain Expansion and Boundary Growth for Open-Set Single-Source Domain Generalization
Pengkun Jiao
Na Zhao
Jingjing Chen
Yu-Gang Jiang
OOD
41
0
0
05 Nov 2024
INQUIRE: A Natural World Text-to-Image Retrieval Benchmark
Edward Vendrow
Omiros Pantazis
Alexander Shepard
Gabriel J. Brostow
Kate E. Jones
Oisin Mac Aodha
Sara Beery
Grant Van Horn
VLM
40
3
0
04 Nov 2024
Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map
Xinyuan Chang
Maixuan Xue
Xinran Liu
Zheng Pan
Xing Wei
62
1
0
31 Oct 2024
EMMA: End-to-End Multimodal Model for Autonomous Driving
Jyh-Jing Hwang
Runsheng Xu
Hubert Lin
Wei-Chih Hung
Jingwei Ji
...
Benjamin Sapp
Yin Zhou
James Guo
Dragomir Anguelov
Mingxing Tan
VLM
LM&Ro
46
28
0
30 Oct 2024
AlphaChimp: Tracking and Behavior Recognition of Chimpanzees
Xiaoxuan Ma
Yutang Lin
Yuan Xu
Stephan P. Kaufhold
Jack Terwilliger
Andres Meza
Yixin Zhu
Federico Rossano
Yizhou Wang
36
0
0
22 Oct 2024
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Michael S Ryoo
Honglu Zhou
Shrikant B. Kendre
Can Qin
Le Xue
Manli Shu
Silvio Savarese
Ran Xu
Caiming Xiong
Juan Carlos Niebles
VGen
43
13
0
21 Oct 2024
Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining
Han Huang
Yuqi Huo
Zijia Zhao
Haoyu Lu
Shu Wu
Bin Wang
Qiang Liu
Weipeng Chen
Liang Wang
VLM
30
1
0
21 Oct 2024
TIPS: Text-Image Pretraining with Spatial awareness
Kevis-Kokitsi Maninis
Kaifeng Chen
Soham Ghosh
Arjun Karpur
Koert Chen
...
Jan Dlabal
Dan Gnanapragasam
Mojtaba Seyedhosseini
Howard Zhou
Andre Araujo
VLM
35
3
0
21 Oct 2024
Assistive AI for Augmenting Human Decision-making
Natabara Máté Gyöngyössy
Bernát Török
Csilla Farkas
Laura Lucaj
Attila Menyhárd
Krisztina Menyhárd-Balázs
András Simonyi
Patrick van der Smagt
Zsolt Ződi
András Lőrincz
39
0
0
18 Oct 2024
Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models
Ce Zhang
Simon Stepputtis
Katia P. Sycara
Yaqi Xie
VLM
37
5
0
16 Oct 2024
DRACO: A Denoising-Reconstruction Autoencoder for Cryo-EM
Yingjun Shen
Haizhao Dai
Qihe Chen
Yan Zeng
Jiakai Zhang
Yuan Pei
Jingyi Yu
24
0
0
15 Oct 2024
Locality Alignment Improves Vision-Language Models
Ian Covert
Tony Sun
James Zou
Tatsunori Hashimoto
VLM
70
4
0
14 Oct 2024
Mamba4Cast: Efficient Zero-Shot Time Series Forecasting with State Space Models
Sathya Kamesh Bhethanabhotla
Omar Swelam
Julien N. Siems
David Salinas
Frank Hutter
Mamba
AI4TS
AI4CE
43
4
0
12 Oct 2024
Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation
Kun Ding
Qiang Yu
Haojian Zhang
Gaofeng Meng
Shiming Xiang
VLM
30
0
0
11 Oct 2024
On a Hidden Property in Computational Imaging
Yinan Feng
Yinpeng Chen
Yueh Lee
Youzuo Lin
28
0
0
11 Oct 2024
LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts
Anh-Quan Cao
M. Jaritz
Matthieu Guillaumin
Raoul de Charette
Loris Bazzani
VLM
CLIP
52
2
0
10 Oct 2024
Evaluating Computational Pathology Foundation Models for Prostate Cancer Grading under Distribution Shifts
Fredrik K. Gustafsson
Mattias Rantalainen
OOD
MedIm
29
0
0
09 Oct 2024
Deep Correlated Prompting for Visual Recognition with Missing Modalities
Lianyu Hu
Tongkai Shi
Wei Feng
Fanhua Shang
Liang Wan
VLM
31
1
0
09 Oct 2024
TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models
Rabin Adhikari
Safal Thapaliya
Manish Dhakal
Bishesh Khanal
MLLM
VLM
35
0
0
07 Oct 2024
Uncertainty-Guided Enhancement on Driving Perception System via Foundation Models
Yunhao Yang
Yuxin Hu
Mao Ye
Zaiwei Zhang
Zhichao Lu
Yi Xu
Ufuk Topcu
Ben Snyder
26
2
0
02 Oct 2024
Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity
Hanqi Jiang
Xixuan Hao
Yuzhou Huang
Chong Ma
Jiaxun Zhang
Yi Pan
Ruimao Zhang
MedIm
37
0
0
01 Oct 2024
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation
Kun Yuan
V. Srivastav
Nassir Navab
N. Padoy
44
7
0
30 Sep 2024
FAST: A Dual-tier Few-Shot Learning Paradigm for Whole Slide Image Classification
Kexue Fu
Xiaoyuan Luo
Linhao Qu
Shuo Wang
Ying Xiong
Ilias Maglogiannis
Longxiang Gao
Manning Wang
31
1
0
29 Sep 2024
Vision-Language Models are Strong Noisy Label Detectors
Tong Wei
Hao Li
Chun-Shu Li
Jiang-Xin Shi
Yu-Feng Li
Min-Ling Zhang
VLM
34
3
0
29 Sep 2024
From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation
Kun Su
Xiulong Liu
Eli Shlizerman
VGen
30
6
0
27 Sep 2024
ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features
Xin Wei
Yaling Tao
Changde Du
Gangming Zhao
Yizhou Yu
Jinpeng Li
33
0
0
24 Sep 2024
LARE: Latent Augmentation using Regional Embedding with Vision-Language Model
Kosuke Sakurai
Tatsuya Ishii
Ryotaro Shimizu
Linxin Song
Masayuki Goto
VLM
26
0
0
19 Sep 2024
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Shengsheng Qian
Zuyi Zhou
Dizhan Xue
Bing Wang
Changsheng Xu
LRM
36
1
0
19 Sep 2024
MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion
Kalakonda Sai Shashank
Shubh Maheshwari
Ravi Kiran Sarvadevabhatla
VGen
DiffM
27
1
0
18 Sep 2024
Evaluating Pre-trained Convolutional Neural Networks and Foundation Models as Feature Extractors for Content-based Medical Image Retrieval
A. Mahbod
Nematollah Saeidi
Sepideh Hatamikia
Ramona Woitek
VLM
MedIm
31
2
0
14 Sep 2024
Phikon-v2, A large and public feature extractor for biomarker prediction
Alexandre Filiot
Paul Jacob
Alice Mac Kain
Charlie Saillard
MedIm
39
17
0
13 Sep 2024
ComAlign: Compositional Alignment in Vision-Language Models
Ali Abdollah
Amirmohammad Izadi
Armin Saghafian
Reza Vahidimajd
Mohammad Mozafari
Amirreza Mirzaei
Mohammadmahdi Samiei
M. Baghshah
CoGe
VLM
30
0
0
12 Sep 2024
Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective
Guimin Hu
Yi Xin
Weimin Lyu
Haojian Huang
Chang Sun
Zehan Zhu
Lin Gui
Ruichu Cai
Erik Cambria
Hasti Seifi
32
5
0
11 Sep 2024
CanvOI, an Oncology Intelligence Foundation Model: Scaling FLOPS Differently
Jonathan Zalach
Inbal Gazy
Assaf Avinoam
Ron Sinai
Eran Shmuel
Inbar Gilboa
Christine Swisher
Naim Matasci
Reva Basho
David B. Agus
40
0
0
04 Sep 2024
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
Manu Gaur
Darshan Singh
Makarand Tapaswi
115
1
0
04 Sep 2024
Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models
K. Nakata
Daisuke Miyashita
Youyang Ng
Yasuto Hoshi
J. Deguchi
39
0
0
29 Aug 2024
RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models
Junyao Ge
Yang Zheng
Kaitai Guo
Jimin Liang
Jimin Liang
38
1
0
27 Aug 2024
A New Era in Computational Pathology: A Survey on Foundation and Vision-Language Models
Dibaloke Chanda
Milan Aryal
Nasim Yahya Soltani
Masoud Ganji
AI4CE
VLM
44
7
0
23 Aug 2024
Has Multimodal Learning Delivered Universal Intelligence in Healthcare? A Comprehensive Survey
Qika Lin
Yifan Zhu
Xin Mei
Ling Huang
Jingying Ma
Kai He
Zhen Peng
Erik Cambria
Mengling Feng
42
18
0
23 Aug 2024
XDT-CXR: Investigating Cross-Disease Transferability in Zero-Shot Binary Classification of Chest X-Rays
Umaima Rahman
Abhishek Basu
Muhammad Uzair Khattak
Aniq Ur Rahman
MedIm
41
0
0
21 Aug 2024
WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared Person Re-Identification
Yonggan Wu
Ling-Chao Meng
Yuan Zichao
Sixian Chan
Hong-Qiang Wang
37
3
0
20 Aug 2024
C
2
{^2}
2
RL: Content and Context Representation Learning for Gloss-free Sign Language Translation and Retrieval
Zhigang Chen
Benjia Zhou
Yiqing Huang
Jun Wan
Yibo Hu
Hailin Shi
Yanyan Liang
Zhen Lei
Du Zhang
VLM
SLR
40
1
0
19 Aug 2024
NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality
Chaofan Tao
Gukyeong Kwon
Varad Gunjal
Hao Yang
Zhaowei Cai
Yonatan Dukler
Ashwin Swaminathan
R. Manmatha
Colin Jon Taylor
Stefano Soatto
CoGe
37
0
0
18 Aug 2024
CROME: Cross-Modal Adapters for Efficient Multimodal LLM
Sayna Ebrahimi
Sercan Ö. Arik
Tejas Nama
Tomas Pfister
44
1
0
13 Aug 2024
Contrastive masked auto-encoders based self-supervised hashing for 2D image and 3D point cloud cross-modal retrieval
Rukai Wei
Heng Cui
Yu Liu
Yufeng Hou
Yanzhao Xie
Ke Zhou
3DPC
29
0
0
11 Aug 2024
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation
Dahyun Kang
Minsu Cho
ObjD
VLM
40
9
0
09 Aug 2024
UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling
Haider Al-Tahan
Q. Garrido
Randall Balestriero
Diane Bouchacourt
C. Hazirbas
Mark Ibrahim
VLM
76
10
0
09 Aug 2024
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling
William Y. Zhu
Keren Ye
Junjie Ke
Jiahui Yu
Leonidas J. Guibas
P. Milanfar
Feng Yang
45
2
0
07 Aug 2024
Previous
1
2
3
4
5
6
...
17
18
19
Next