Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2103.00020
Cited By
Learning Transferable Visual Models From Natural Language Supervision
26 February 2021
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
Sandhini Agarwal
Girish Sastry
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Github (29177★)
Papers citing
"Learning Transferable Visual Models From Natural Language Supervision"
50 / 1,722 papers shown
Title
A Parameter-Efficient Tuning Framework for Language-guided Object Grounding and Robot Grasping
Houjian Yu
Mingen Li
Alireza Rezazadeh
Yang Yang
Changhyun Choi
103
2
0
28 Sep 2024
Search3D: Hierarchical Open-Vocabulary 3D Segmentation
Ayca Takmaz
Alexandros Delitzas
R. Sumner
Francis Engelmann
Johanna Wald
Federico Tombari
156
13
0
27 Sep 2024
Text2FX: Harnessing CLAP Embeddings for Text-Guided Audio Effects
Annie Chu
P. O'Reilly
Julia Barnett
Bryan Pardo
CLIP
109
3
0
27 Sep 2024
MIO: A Foundation Model on Multimodal Tokens
Zekun Wang
King Zhu
Chunpu Xu
Wangchunshu Zhou
Jiaheng Liu
...
Yuanxing Zhang
Ge Zhang
Ke Xu
Jie Fu
Wenhao Huang
MLLM
AuLLM
138
12
0
26 Sep 2024
Multiplicative Logit Adjustment Approximates Neural-Collapse-Aware Decision Boundary Adjustment
Naoya Hasegawa
Issei Sato
93
0
0
26 Sep 2024
Explanation Bottleneck Models
Shinýa Yamaguchi
Kosuke Nishida
LRM
BDL
126
2
0
26 Sep 2024
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Kai Chen
Yunhao Gou
Runhui Huang
Zhili Liu
Daxin Tan
...
Qun Liu
Jun Yao
Lu Hou
Hang Xu
Hang Xu
AuLLM
MLLM
VLM
168
29
0
26 Sep 2024
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
Chenming Zhu
Tai Wang
Wenwei Zhang
Jiangmiao Pang
Xihui Liu
228
53
0
26 Sep 2024
JoyType: A Robust Design for Multilingual Visual Text Creation
Chao Li
Chen Jiang
Xiaolong Liu
Jun Zhao
Guoxin Wang
DiffM
107
7
0
26 Sep 2024
Multi-View and Multi-Scale Alignment for Contrastive Language-Image Pre-training in Mammography
Yuexi Du
John Onofrey
Nicha Dvornek
VLM
99
2
0
26 Sep 2024
PACE: Marrying generalization in PArameter-efficient fine-tuning with Consistency rEgularization
Yao Ni
Shan Zhang
Piotr Koniusz
448
8
0
25 Sep 2024
Spacewalker: Traversing Representation Spaces for Fast Interactive Exploration and Annotation of Unstructured Data
Lukas Heine
Fabian Horst
Jana Fragemann
Gijs Luijten
M. Balzer
Jan Egger
F. Bahnsen
M. Sarfraz
Jens Kleesiek
78
0
0
25 Sep 2024
Single Image, Any Face: Generalisable 3D Face Generation
Wenqing Wang
Haosen Yang
Josef Kittler
Xiatian Zhu
3DH
133
0
0
25 Sep 2024
Generative Object Insertion in Gaussian Splatting with a Multi-View Diffusion Model
Hongliang Zhong
Can Wang
Jingbo Zhang
Jing Liao
3DGS
DiffM
78
2
0
25 Sep 2024
GeoBiked: A Dataset with Geometric Features and Automated Labeling Techniques to Enable Deep Generative Models in Engineering Design
Phillip Mueller
Sebastian Mueller
Lars Mikelsons
82
2
0
25 Sep 2024
Ctrl-GenAug: Controllable Generative Augmentation for Medical Sequence Classification
Xinrui Zhou
Yuhao Huang
Haoran Dou
Shijing Chen
Ao Chang
...
Jie Jessie Ren
Ruobing Huang
Jun Cheng
Wufeng Xue
Dong Ni
MedIm
354
0
0
25 Sep 2024
CloudTrack: Scalable UAV Tracking with Cloud Semantics
Yannik Blei
Michael Krawez
Nisarga Nilavadi
Tanja Katharina Kaiser
Wolfram Burgard
86
1
0
24 Sep 2024
Language-based Audio Moment Retrieval
Hokuto Munakata
Taichi Nishimura
Shota Nakada
Tatsuya Komatsu
124
2
0
24 Sep 2024
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension
Junzhuo Liu
Xiaohu Yang
Weiwei Li
Peng Wang
ObjD
126
5
0
23 Sep 2024
ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models
Sombit Dey
Jan-Nico Zaech
Nikolay Nikolov
Luc Van Gool
Danda Pani Paudel
MoMe
VLM
111
5
0
23 Sep 2024
TSCLIP: Robust CLIP Fine-Tuning for Worldwide Cross-Regional Traffic Sign Recognition
Guoyang Zhao
Fulong Ma
Weiqing Qi
Chenguang Zhang
Yuxuan Liu
Ming Liu
Jun Ma
VLM
CLIP
388
3
0
23 Sep 2024
DiSPo: Diffusion-SSM based Policy Learning for Coarse-to-Fine Action Discretization
Nayoung Oh
Jaehyeong Jang
Moonkyeong Jung
Daehyung Park
451
0
0
23 Sep 2024
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
Weifeng Lin
Xinyu Wei
Renrui Zhang
Le Zhuo
Shitian Zhao
...
Junlin Xie
Junlin Xie
Yu Qiao
Peng Gao
Hongsheng Li
MLLM
DiffM
174
15
0
23 Sep 2024
MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models
Mohammad Shahab Sepehri
Zalan Fabian
Maryam Soltanolkotabi
Mahdi Soltanolkotabi
MedIm
123
6
0
23 Sep 2024
OmniBench: Towards The Future of Universal Omni-Language Models
Yizhi Li
Ge Zhang
Yinghao Ma
Ruibin Yuan
Kang Zhu
...
Zhaoxiang Zhang
Zachary Liu
Emmanouil Benetos
Wenhao Huang
Chenghua Lin
LRM
124
19
0
23 Sep 2024
Dormant: Defending against Pose-driven Human Image Animation
Jiachen Zhou
Mingsi Wang
Tianlin Li
Guozhu Meng
Kai Chen
152
4
0
22 Sep 2024
Semi-intrusive audio evaluation: Casting non-intrusive assessment as a multi-modal text prediction task
Jozef Coldenhoff
Milos Cernak
92
0
0
21 Sep 2024
Relevance-driven Decision Making for Safer and More Efficient Human Robot Collaboration
Xiaotong Zhang
Dingcheng Huang
Kamal Youcef-Toumi
67
2
0
21 Sep 2024
PointSAM: Pointly-Supervised Segment Anything Model for Remote Sensing Images
Nanqing Liu
Xun Xu
Yongyi Su
Haojie Zhang
Heng-Chao Li
VLM
115
15
0
20 Sep 2024
OLiVia-Nav: An Online Lifelong Vision Language Approach for Mobile Robot Social Navigation
Siddarth Narasimhan
Aaron Hao Tan
Daniel Choi
G. Nejat
LM&Ro
120
4
0
20 Sep 2024
MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension
Ting Liu
Zunnan Xu
Yue Hu
Liangtao Shi
Zhiqiang Wang
Quanjun Yin
118
3
0
20 Sep 2024
Generating Visual Stories with Grounded and Coreferent Characters
Danyang Liu
Mirella Lapata
Frank Keller
100
2
0
20 Sep 2024
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
Zhecan Wang
Junzhang Liu
Chia-Wei Tang
Hani Alomari
Anushka Sivakumar
...
Haoxuan You
A. Ishmam
Kai-Wei Chang
Shih-Fu Chang
Chris Thomas
CoGe
VLM
161
2
0
19 Sep 2024
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Zuyan Liu
Yuhao Dong
Ziwei Liu
Winston Hu
Jiwen Lu
Yongming Rao
ObjD
178
72
0
19 Sep 2024
3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion
Zhaoxi Chen
Jiaxiang Tang
Yuhao Dong
Ziang Cao
Fangzhou Hong
...
Tong Wu
Shunsuke Saito
Liang Pan
Dahua Lin
Ziwei Liu
112
23
0
19 Sep 2024
End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting
Yongqi Wang
Xinxiao Wu
Shuo Yang
Jiebo Luo
454
1
0
19 Sep 2024
Towards Global Localization using Multi-Modal Object-Instance Re-Identification
Aneesh Chavan
Vaibhav Agrawal
Vineeth Bhat
Sarthak Chittawar
Siddharth Srivastava
Chetan Arora
K. M. Krishna
140
0
0
18 Sep 2024
Mamba Fusion: Learning Actions Through Questioning
Zhikang Dong
Apoorva Beedu
Jason Sheinkopf
Irfan Essa
Mamba
132
2
0
17 Sep 2024
HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models
V. Bhat
Prashanth Krishnamurthy
Ramesh Karri
Farshad Khorrami
134
5
0
16 Sep 2024
EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models
Yupeng Chen
Penglin Chen
Xiaoyu Zhang
Yixian Huang
Qian Xie
DiffM
85
1
0
15 Sep 2024
Generative Semantic Communication via Textual Prompts: Latency Performance Tradeoffs
Mengmeng Ren
Li Qiao
Long Yang
Zhen Gao
Jian Chen
Mahdi Boloursaz Mashhadi
Pei Xiao
Rahim Tafazolli
Mehdi Bennis
VLM
137
5
0
15 Sep 2024
One missing piece in Vision and Language: A Survey on Comics Understanding
Emanuele Vivoli
Andrey Barsky
Mohamed Ali Souibgui
Artemis LLabres
Marco Bertini
Dimosthenis Karatzas
104
5
0
14 Sep 2024
Interpretable Vision-Language Survival Analysis with Ordinal Inductive Bias for Computational Pathology
Pei Liu
Luping Ji
Jiaxiang Gou
Bo Fu
Mao Ye
185
2
0
14 Sep 2024
Evaluating Pre-trained Convolutional Neural Networks and Foundation Models as Feature Extractors for Content-based Medical Image Retrieval
Amirreza Mahbod
Nematollah Saeidi
Sepideh Hatamikia
Ramona Woitek
VLM
MedIm
99
3
0
14 Sep 2024
GroundingBooth: Grounding Text-to-Image Customization
Zhexiao Xiong
Wei Xiong
Jing Shi
He Zhang
Yizhi Song
Nathan Jacobs
DiffM
128
9
0
13 Sep 2024
ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE
Sichun Wu
Kazi Injamamul Haque
Zerrin Yumak
VGen
79
2
0
12 Sep 2024
Relevance for Human Robot Collaboration
Xiaotong Zhang
Dingcheng Huang
Kamal Youcef-Toumi
111
2
0
12 Sep 2024
Foundation Models Boost Low-Level Perceptual Similarity Metrics
Abhijay Ghildyal
Nabajeet Barman
Saman Zadtootaghaj
90
4
0
11 Sep 2024
Towards Predicting Temporal Changes in a Patient's Chest X-ray Images based on Electronic Health Records
Daeun Kyung
J. Kim
Tackeun Kim
Edward Choi
MedIm
DiffM
91
1
0
11 Sep 2024
What to align in multimodal contrastive learning?
Benoit Dufumier
J. Castillo-Navarro
D. Tuia
Jean-Philippe Thiran
134
4
0
11 Sep 2024
Previous
1
2
3
...
22
23
24
...
33
34
35
Next