Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2103.00020
Cited By
Learning Transferable Visual Models From Natural Language Supervision
26 February 2021
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
Sandhini Agarwal
Girish Sastry
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Learning Transferable Visual Models From Natural Language Supervision"
50 / 9,779 papers shown
Title
VIDSTAMP: A Temporally-Aware Watermark for Ownership and Integrity in Video Diffusion Models
Mohammadreza Teymoorianfard
Shiqing Ma
Amir Houmansadr
WIGM
67
0
0
02 May 2025
Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation
Daniele Molino
Francesco Di Feola
Linlin Shen
Paolo Soda
V. Guarrasi
MedIm
LM&MA
67
0
0
02 May 2025
On the effectiveness of Large Language Models in the mechanical design domain
Daniele Grandi
Fabian Riquelme
24
0
0
02 May 2025
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving
Zhijie Qiao
Haowei Li
Zhong Cao
Henry X. Liu
VLM
89
4
0
01 May 2025
T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation
Xuyang Guo
Jiayan Huo
Zhenmei Shi
Zhao-quan Song
Jiahao Zhang
Jiale Zhao
EGVM
VGen
PINN
82
1
0
01 May 2025
InstructAttribute: Fine-grained Object Attributes editing with Instruction
Xingxi Yin
Jingfeng Zhang
Zhi Li
Y. Li
Wenjie Qu
DiffM
165
0
0
01 May 2025
T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
D. Jiang
Ziyu Guo
Renrui Zhang
Zhuofan Zong
Hao Li
Le Zhuo
Shilin Yan
Pheng-Ann Heng
Yiming Li
LRM
72
2
0
01 May 2025
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
Wufei Ma
Luoxin Ye
Nessa McWeeney
Celso M de Melo
A. Yuille
Jieneng Chen
LRM
65
1
0
01 May 2025
Empowering Agentic Video Analytics Systems with Video Language Models
Yuxuan Yan
Shiqi Jiang
Ting Cao
Xuming Hu
Qianqian Yang
Yuanchao Shu
Yi Yang
Lili Qiu
VLM
70
0
0
01 May 2025
R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training
Albert Ge
Tzu-Heng Huang
John Cooper
Avi Trost
Ziyi Chu
Satya Sai Srinath Namburi GNVV
Ziyang Cai
Kendall Park
Nicholas Roberts
Frederic Sala
53
0
0
01 May 2025
JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers
Kwon Byung-Ki
Qi Dai
Lee Hyoseok
Chong Luo
Tae-Hyun Oh
71
0
0
01 May 2025
AdCare-VLM: Leveraging Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care
Md Asaduzzaman Jabin
Hanqi Jiang
Y. Li
Patrick Kaggwa
Eugene Douglass
Juliet N. Sekandi
Tianming Liu
LM&MA
76
0
0
01 May 2025
A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI
Lik Hang Kenny Wong
Xueyang Kang
Kaixin Bai
Jianwei Zhang
56
0
0
01 May 2025
Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions
Yiming Du
Wenyu Huang
Danna Zheng
Zhaowei Wang
Sébastien Montella
Mirella Lapata
Kam-Fai Wong
Jeff Z. Pan
KELM
MU
80
2
0
01 May 2025
Multi-Modal Language Models as Text-to-Image Model Evaluators
Jiahui Chen
Candace Ross
Reyhane Askari Hemmat
Koustuv Sinha
Melissa Hall
M. Drozdzal
Adriana Romero-Soriano
EGVM
60
0
0
01 May 2025
Cues3D: Unleashing the Power of Sole NeRF for Consistent and Unique Instances in Open-Vocabulary 3D Panoptic Segmentation
Feng Xue
Wenzhuang Xu
Guofeng Zhong
Anlong Minga
N. Sebe
65
0
0
01 May 2025
Controllable Weather Synthesis and Removal with Video Diffusion Models
Chih-Hao Lin
ziqi wang
Ruofan Liang
Yuxuan Zhang
Sanja Fidler
Shenlong Wang
Zan Gojcic
DiffM
VGen
48
0
0
01 May 2025
Handling Label Noise via Instance-Level Difficulty Modeling and Dynamic Optimization
Kuan Zhang
Chengliang Chai
Jingzhe Xu
Chi Zhang
Ye Yuan
Guoren Wang
Lei Cao
NoLa
66
0
0
01 May 2025
Investigating Zero-Shot Diagnostic Pathology in Vision-Language Models with Efficient Prompt Design
Vasudev Sharma
Ahmed Alagha
Abdelhakim Khellaf
Vincent Quoc-Huy Trinh
Mahdi S. Hosseini
38
0
0
30 Apr 2025
MoSAM: Motion-Guided Segment Anything Model with Spatial-Temporal Memory Selection
Q. Yang
Yuan Yao
Miaomiao Cui
Liefeng Bo
VLM
61
0
0
30 Apr 2025
VividListener: Expressive and Controllable Listener Dynamics Modeling for Multi-Modal Responsive Interaction
Shiying Li
Xingqun Qi
Bingkun Yang
Chen Weile
Zezhao Tian
Muyi Sun
Qifeng Liu
Man Zhang
Zhenan Sun
64
0
0
30 Apr 2025
Online Federation For Mixtures of Proprietary Agents with Black-Box Encoders
Xuwei Yang
Fatemeh Tavakoli
D. B. Emerson
Anastasis Kratsios
FedML
62
0
0
30 Apr 2025
ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction
Qihao Liu
Ju He
Qihang Yu
Liang-Chieh Chen
Alan Yuille
DiffM
VGen
83
0
0
30 Apr 2025
The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning
Siyi Chen
Yimeng Zhang
Sijia Liu
Q. Qu
AAML
147
0
0
30 Apr 2025
Adapting In-Domain Few-Shot Segmentation to New Domains without Retraining
Qi Fan
Kaiqi Liu
Nian Liu
Hisham Cholakkal
Rao Muhammad Anwer
Wenbin Li
Yang Gao
77
0
0
30 Apr 2025
An Evaluation of a Visual Question Answering Strategy for Zero-shot Facial Expression Recognition in Still Images
Modesto Castrillón-Santana
Oliverio J. Santana
David Freire-Obregón
Daniel Hernández-Sosa
J. Lorenzo-Navarro
54
0
0
30 Apr 2025
Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding
Trilok Padhi
R. Kaur
Adam D. Cobb
Manoj Acharya
Anirban Roy
Colin Samplawski
Brian Matejek
Alexander M. Berenbeim
Nathaniel D. Bastian
Susmit Jha
28
0
0
30 Apr 2025
Vision-Language Model-Based Semantic-Guided Imaging Biomarker for Early Lung Cancer Detection
Luoting Zhuang
Seyed Mohammad Hossein Tabatabaei
Ramin Salehi-Rad
Linh M. Tran
Denise R. Aberle
Ashley E. Prosper
William Hsu
36
0
0
30 Apr 2025
MagicPortrait: Temporally Consistent Face Reenactment with 3D Geometric Guidance
Mengting Wei
Yante Li
Tuomas Varanka
Yan Jiang
Guoying Zhao
DiffM
VGen
74
0
0
30 Apr 2025
Why Compress What You Can Generate? When GPT-4o Generation Ushers in Image Compression Fields
Yixin Gao
Xiaohan Pan
X. Li
Zhibo Chen
51
0
0
30 Apr 2025
GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling
Siqi Li
Yufan Shen
Xiangnan Chen
Jiayi Chen
Hengwei Ju
...
Licheng Wen
Botian Shi
Y. Liu
Xinyu Cai
Yu Qiao
VLM
ELM
91
0
0
30 Apr 2025
Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language Encoders
Andrei-Alexandru Manea
Jindřich Libovický
VLM
52
0
0
30 Apr 2025
AGHI-QA: A Subjective-Aligned Dataset and Metric for AI-Generated Human Images
Yunhao Li
Sijing Wu
Wei Sun
Zhichao Zhang
Yucheng Zhu
Zicheng Zhang
Huiyu Duan
Xiongkuo Min
Guangtao Zhai
EGVM
93
0
0
30 Apr 2025
RoboGround: Robotic Manipulation with Grounded Vision-Language Priors
Haifeng Huang
Xinyi Chen
Yuhang Chen
Yiming Li
Xiaoshen Han
Zihao Wang
Tai Wang
Jiangmiao Pang
Zhou Zhao
LM&Ro
80
0
0
30 Apr 2025
XeMap: Contextual Referring in Large-Scale Remote Sensing Environments
Yicong Li
Lu Si
Y. T. Hou
Chengaung Liu
Yangqiu Song
Hongjian Fang
Jianwei Zhang
79
0
0
30 Apr 2025
Wireless Communication as an Information Sensor for Multi-agent Cooperative Perception: A Survey
Zhiying Song
Tenghui Xie
Fuxi Wen
Jun Li
44
0
0
30 Apr 2025
Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models
Minh-Hao Van
Xintao Wu
VLM
88
0
0
30 Apr 2025
Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection
Daniel Bogdoll
Rajanikant Ananta
Abeyankar Giridharan
Isabel Moore
Gregory Stevens
Henry X. Liu
VLM
51
0
0
30 Apr 2025
Diff-Prompt: Diffusion-Driven Prompt Generator with Mask Supervision
Weicai Yan
Wang Lin
Zirun Guo
Ye Wang
Fangming Feng
Xiaoda Yang
ziqi wang
Tao Jin
DiffM
132
2
0
30 Apr 2025
Rethinking Visual Layer Selection in Multimodal LLMs
H. Chen
Junyan Lin
Xinhao Chen
Yue Fan
Xin Jin
Hui Su
Jianfeng Dong
Jinlan Fu
Xiaoyu Shen
VLM
95
0
0
30 Apr 2025
LangWBC: Language-directed Humanoid Whole-Body Control via End-to-end Learning
Yiyang Shao
Xiaoyu Huang
Bike Zhang
Qiayuan Liao
Yuman Gao
Yufeng Chi
Zhongyu Li
Sophia Shao
K. Sreenath
LM&Ro
157
0
0
30 Apr 2025
Direct Motion Models for Assessing Generated Videos
Kelsey R. Allen
Carl Doersch
Guangyao Zhou
Mohammed Suhail
Danny Driess
...
Thomas Kipf
Mehdi S. M. Sajjadi
Kevin P. Murphy
João Carreira
Sjoerd van Steenkiste
EGVM
DiffM
VGen
78
0
0
30 Apr 2025
Synergy-CLIP: Extending CLIP with Multi-modal Integration for Robust Representation Learning
Sangyeon Cho
Jangyeong Jeon
Mingi Kim
Junyeong Kim
CLIP
VLM
76
0
0
30 Apr 2025
OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models
Shengkai Chen
Yifang Yin
Jinming Cao
Shili Xiang
Zhenguang Liu
Roger Zimmermann
VOS
VLM
48
0
0
30 Apr 2025
CoCoDiff: Diversifying Skeleton Action Features via Coarse-Fine Text-Co-Guided Latent Diffusion
Zhifu Zhao
Hanyang Hua
Jiajian Li
Shaoxin Wu
Fu Li
Yangtao Zhou
Yang Li
DiffM
68
0
0
30 Apr 2025
GarmentDiffusion: 3D Garment Sewing Pattern Generation with Multimodal Diffusion Transformers
Xinyu Li
Qi Yao
Yalin Wang
DiffM
48
0
0
30 Apr 2025
Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models
Sangmin Woo
Kang Zhou
Yun Zhou
Shuai Wang
Sheng Guan
Haibo Ding
Lin Lee Cheong
VPVLM
83
0
0
30 Apr 2025
Visual Text Processing: A Comprehensive Review and Unified Evaluation
Yan Shu
Weichao Zeng
Fangmin Zhao
Zeyu Chen
Zeju Li
...
Paolo Rota
Xiang Bai
Lianwen Jin
Xu-Cheng Yin
N. Sebe
CoGe
61
0
0
30 Apr 2025
SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding
Chenkai Zhang
Yiming Lei
Ziqiang Liu
Haitao Leng
Shaoguo Liu
Tingting Gao
Qingjie Liu
Yunhong Wang
AI4TS
56
0
0
30 Apr 2025
T2ID-CAS: Diffusion Model and Class Aware Sampling to Mitigate Class Imbalance in Neck Ultrasound Anatomical Landmark Detection
Manikanta Varaganti
Amulya Vankayalapati
Nour Awad
Gregory R. Dion
Laura J. Brattain
DiffM
MedIm
67
0
0
29 Apr 2025
Previous
1
2
3
...
6
7
8
...
194
195
196
Next