Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2006.06666
Cited By
VirTex: Learning Visual Representations from Textual Annotations
11 June 2020
Karan Desai
Justin Johnson
SSL
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"VirTex: Learning Visual Representations from Textual Annotations"
50 / 107 papers shown
Title
Boosting Single-domain Generalized Object Detection via Vision-Language Knowledge Interaction
Xiaoran Xu
Jiangang Yang
Wenyue Chong
Wenhui Shi
S.
Jing Xing
Jian Liu
ObjD
VLM
81
0
0
27 Apr 2025
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya
Po-Yao (Bernie) Huang
Peize Sun
Jang Hyun Cho
Andrea Madotto
...
Shiyu Dong
Nikhila Ravi
Daniel Li
Piotr Dollár
Christoph Feichtenhofer
ObjD
VOS
103
0
0
17 Apr 2025
A Comprehensive Survey of Foundation Models in Medicine
Wasif Khan
Seowung Leem
Kyle B. See
Joshua K. Wong
Shaoting Zhang
R. Fang
AI4CE
LM&MA
VLM
105
18
0
17 Jan 2025
CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance
Chu Myaet Thwal
Ye Lin Tun
Minh N. H. Nguyen
Eui-nam Huh
Choong Seon Hong
VLM
74
0
0
05 Dec 2024
Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers
Chancharik Mitra
Brandon Huang
Tianning Chai
Zhiqiu Lin
Assaf Arbelle
Rogerio Feris
Leonid Karlinsky
Trevor Darrell
Deva Ramanan
Roei Herzig
VLM
123
4
0
28 Nov 2024
Learning from Convolution-based Unlearnable Datasets
Dohyun Kim
Pedro Sandoval-Segura
MU
91
1
0
04 Nov 2024
A new approach for encoding code and assisting code understanding
Mengdan Fan
Changde Du
Haiyan Zhao
Zhi Jin
46
0
0
01 Aug 2024
What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models
Abdelrahman Abdelhamed
Mahmoud Afifi
Alec Go
MLLM
VLM
33
3
0
24 May 2024
GOOD: Towards Domain Generalized Orientated Object Detection
Qi Bi
Beichen Zhou
Jingjun Yi
Wei Ji
Haolan Zhan
Gui-Song Xia
ObjD
OOD
76
2
0
20 Feb 2024
POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images
Antonín Vobecký
Oriane Siméoni
David Hurych
Spyros Gidaris
Andrei Bursuc
Patrick Pérez
Josef Sivic
40
33
0
17 Jan 2024
From Text to Pixels: A Context-Aware Semantic Synergy Solution for Infrared and Visible Image Fusion
Xingyuan Li
Yang Zou
Jinyuan Liu
Zhiying Jiang
Long Ma
Xin-Yue Fan
Risheng Liu
40
4
0
31 Dec 2023
SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing
Zhecheng Wang
R. Prabha
Tianyuan Huang
Jiajun Wu
Ram Rajagopal
34
53
0
20 Dec 2023
Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot Anomaly Localization
Hanqiu Deng
Zhaoxiang Zhang
Jinan Bao
Xingyu Li
VLM
27
4
0
30 Aug 2023
Multimodal Foundation Models For Echocardiogram Interpretation
M. Christensen
Milos Vukadinovic
N. Yuan
David Ouyang
MedIm
21
7
0
29 Aug 2023
RECLIP: Resource-efficient CLIP by Training with Small Images
Runze Li
Dahun Kim
B. Bhanu
Weicheng Kuo
VLM
CLIP
28
12
0
12 Apr 2023
Learning Transferable Pedestrian Representation from Multimodal Information Supervision
Li-Na Bao
Longhui Wei
Xiaoyu Qiu
Wen-gang Zhou
Houqiang Li
Qi Tian
SSL
18
5
0
12 Apr 2023
SPAN: Learning Similarity between Scene Graphs and Images with Transformers
Yuren Cong
Wentong Liao
Bodo Rosenhahn
M. Yang
35
6
0
02 Apr 2023
CUDA: Convolution-based Unlearnable Datasets
Vinu Sankar Sadasivan
Mahdi Soltanolkotabi
S. Feizi
MU
29
25
0
07 Mar 2023
CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning
Hritik Bansal
Nishad Singhi
Yu Yang
Fan Yin
Aditya Grover
Kai-Wei Chang
AAML
34
42
0
06 Mar 2023
Learning Visual Representations via Language-Guided Sampling
Mohamed El Banani
Karan Desai
Justin Johnson
SSL
VLM
13
28
0
23 Feb 2023
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Raghav Goyal
E. Mavroudi
Xitong Yang
Sainbayar Sukhbaatar
Leonid Sigal
Matt Feiszli
Lorenzo Torresani
Du Tran
14
7
0
16 Feb 2023
Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation
Bingqian Lin
Yi Zhu
Xiaodan Liang
Liang Lin
Jian-zhuo Liu
CoGe
LM&Ro
35
3
0
13 Feb 2023
Advancing Radiograph Representation Learning with Masked Record Modeling
Hong-Yu Zhou
Chenyu Lian
Lian-cheng Wang
Yizhou Yu
MedIm
36
55
0
30 Jan 2023
Vision Learners Meet Web Image-Text Pairs
Bingchen Zhao
Quan Cui
Hao Wu
Osamu Yoshie
Cheng Yang
Oisin Mac Aodha
VLM
24
5
0
17 Jan 2023
RILS: Masked Visual Reconstruction in Language Semantic Space
Shusheng Yang
Yixiao Ge
Kun Yi
Dian Li
Ying Shan
Xiaohu Qie
Xinggang Wang
CLIP
43
11
0
17 Jan 2023
EXIF as Language: Learning Cross-Modal Associations Between Images and Camera Metadata
Chenhao Zheng
Ayush Shrivastava
Andrew Owens
VLM
28
11
0
11 Jan 2023
Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders
Renrui Zhang
Liuhui Wang
Yu Qiao
Peng Gao
Hongsheng Li
3DPC
35
124
0
13 Dec 2022
Using Multiple Instance Learning to Build Multimodal Representations
Peiqi Wang
W. Wells
Seth Berkowitz
Steven Horng
Polina Golland
SSL
24
6
0
11 Dec 2022
Scaling Language-Image Pre-training via Masking
Yanghao Li
Haoqi Fan
Ronghang Hu
Christoph Feichtenhofer
Kaiming He
CLIP
VLM
27
318
0
01 Dec 2022
Improving Cross-Modal Retrieval with Set of Diverse Embeddings
Dongwon Kim
Nam-Won Kim
Suha Kwak
24
37
0
30 Nov 2022
Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors
R. Burgert
Kanchana Ranasinghe
Xiang Li
Michael S. Ryoo
DiffM
VLM
27
37
0
23 Nov 2022
The Role of Local Alignment and Uniformity in Image-Text Contrastive Learning on Medical Images
Philip Muller
Georgios Kaissis
Daniel Rueckert
MedIm
16
7
0
14 Nov 2022
Unsupervised Audio-Visual Lecture Segmentation
Darshan Singh
Anchit Gupta
C. V. Jawahar
Makarand Tapaswi
VOS
16
4
0
29 Oct 2022
MovieCLIP: Visual Scene Recognition in Movies
Digbalay Bose
Rajat Hebbar
Krishna Somandepalli
Haoyang Zhang
Yin Cui
K. Cole-McLaughlin
H. Wang
Shrikanth Narayanan
CLIP
12
20
0
20 Oct 2022
VTC: Improving Video-Text Retrieval with User Comments
Laura Hanu
James Thewlis
Yuki M. Asano
Christian Rupprecht
VGen
21
7
0
19 Oct 2022
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text
Zifeng Wang
Zhenbang Wu
Dinesh Agarwal
Jimeng Sun
CLIP
VLM
MedIm
29
397
0
18 Oct 2022
FETA: Towards Specializing Foundation Models for Expert Task Applications
Amit Alfassy
Assaf Arbelle
Oshri Halimi
Sivan Harary
Roei Herzig
...
Christoph Auer
Kate Saenko
Peter W. J. Staar
Rogerio Feris
Leonid Karlinsky
21
19
0
08 Sep 2022
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
Xiaoyi Dong
Jianmin Bao
Yinglin Zheng
Ting Zhang
Dongdong Chen
...
Weiming Zhang
Lu Yuan
Dong Chen
Fang Wen
Nenghai Yu
CLIP
VLM
40
157
0
25 Aug 2022
RadTex: Learning Efficient Radiograph Representations from Text Reports
Keegan Quigley
Miriam Cha
Ruizhi Liao
Geeticka Chauhan
Steven Horng
Seth Berkowitz
Polina Golland
MedIm
22
3
0
05 Aug 2022
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
Haoxuan You
Luowei Zhou
Bin Xiao
Noel Codella
Yu Cheng
Ruochen Xu
Shih-Fu Chang
Lu Yuan
CLIP
VLM
21
48
0
26 Jul 2022
Towards the Human Global Context: Does the Vision-Language Model Really Judge Like a Human Being?
Sangmyeong Woh
Jaemin Lee
Hoki Kim
Jinsuk Lee
21
0
0
18 Jul 2022
Contrastive Adapters for Foundation Model Group Robustness
Michael Zhang
Christopher Ré
VLM
10
61
0
14 Jul 2022
e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce
Wonyoung Shin
Jonghun Park
Taekang Woo
Yongwoo Cho
Kwangjin Oh
Hwanjun Song
VLM
27
16
0
01 Jul 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Zi-Yi Dou
Aishwarya Kamath
Zhe Gan
Pengchuan Zhang
Jianfeng Wang
...
Ce Liu
Yann LeCun
Nanyun Peng
Jianfeng Gao
Lijuan Wang
VLM
ObjD
21
124
0
15 Jun 2022
Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation
Wouter Van Gansbeke
Simon Vandenhende
Luc Van Gool
36
55
0
13 Jun 2022
INDIGO: Intrinsic Multimodality for Domain Generalization
Puneet Mangla
Shivam Chandhok
Milan Aggarwal
V. Balasubramanian
Balaji Krishnamurthy
VLM
35
2
0
13 Jun 2022
CyCLIP: Cyclic Contrastive Language-Image Pretraining
Shashank Goel
Hritik Bansal
S. Bhatia
Ryan A. Rossi
Vishwa Vinay
Aditya Grover
CLIP
VLM
173
132
0
28 May 2022
Learning to Answer Visual Questions from Web Videos
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
34
33
0
10 May 2022
P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision
Henghui Zhao
Isma Hadji
Nikita Dvornik
Konstantinos G. Derpanis
Richard P. Wildes
Allan D. Jepson
26
45
0
04 May 2022
Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)
Alex Fang
Gabriel Ilharco
Mitchell Wortsman
Yu Wan
Vaishaal Shankar
Achal Dave
Ludwig Schmidt
VLM
OOD
31
138
0
03 May 2022
1
2
3
Next