Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2103.00020
Cited By
Learning Transferable Visual Models From Natural Language Supervision
26 February 2021
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
Sandhini Agarwal
Girish Sastry
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Github (29177★)
Papers citing
"Learning Transferable Visual Models From Natural Language Supervision"
50 / 1,722 papers shown
Title
LoVA: Long-form Video-to-Audio Generation
Xin Cheng
Xihua Wang
Yihan Wu
Yuyue Wang
Ruihua Song
VGen
DiffM
97
3
0
31 Dec 2024
VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis
Zhipeng Chen
Lan Yang
Yonggang Qi
Honggang Zhang
Kaiyue Pang
Ke Li
Yi-Zhe Song
DiffM
156
0
0
31 Dec 2024
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
Peng Jin
Haoyang Li
Li Yuan
Shuicheng Yan
Jie Chen
128
2
0
31 Dec 2024
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Hao Fei
Shengqiong Wu
Hao Zhang
Tat-Seng Chua
Shuicheng Yan
171
42
0
31 Dec 2024
Demystifying CLIP Data
Hu Xu
Saining Xie
Xiaoqing Ellen Tan
Po-Yao (Bernie) Huang
Russell Howes
Vasu Sharma
Shang-Wen Li
Gargi Ghosh
Luke Zettlemoyer
Christoph Feichtenhofer
VLM
CLIP
116
127
0
31 Dec 2024
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
Hanguang Xiao
Feizhong Zhou
Xianglong Liu
Tianqi Liu
Zhipeng Li
Xin Liu
Xiaoxuan Huang
AILaw
LM&MA
LRM
118
29
0
31 Dec 2024
Grid Diffusion Models for Text-to-Video Generation
Taegyeong Lee
Soyeong Kwon
Taehwan Kim
125
8
0
31 Dec 2024
Fine-grained and Explainable Factuality Evaluation for Multimodal Summarization
Liqiang Jing
Jingxuan Zuo
Yue Zhang
109
8
0
31 Dec 2024
Enhancing Visual Representation for Text-based Person Searching
Wei Shen
Ming Fang
Yuxia Wang
Jiafeng Xiao
Diping Li
Ningyu Zhang
Ling Xu
Weinan Zhang
91
1
0
31 Dec 2024
Multimodal Fusion and Coherence Modeling for Video Topic Segmentation
Hai Yu
Chong Deng
Qinglin Zhang
Jiaqing Liu
Qian Chen
Wen Wang
164
0
0
31 Dec 2024
DriveEditor: A Unified 3D Information-Guided Framework for Controllable Object Editing in Driving Scenes
Yiyuan Liang
Zhiying Yan
Liqun Chen
Jiahuan Zhou
Luxin Yan
Sheng Zhong
Xu Zou
DiffM
VGen
108
1
0
31 Dec 2024
Combating Label Noise With A General Surrogate Model For Sample Selection
Chao Liang
Linchao Zhu
Humphrey Shi
Yi Yang
VLM
NoLa
99
2
0
31 Dec 2024
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
Pinelopi Papalampidi
Skanda Koppula
Shreya Pathak
Justin T Chiu
Joseph Heyward
Viorica Patraucean
Jiajun Shen
Antoine Miech
Andrew Zisserman
Aida Nematzdeh
VLM
120
26
0
31 Dec 2024
AdaDiff: Adaptive Step Selection for Fast Diffusion Models
Hui Zhang
Zuxuan Wu
Zhen Xing
Jie Shao
Yu-Gang Jiang
142
13
0
31 Dec 2024
Multi-Agent Planning Using Visual Language Models
Michele Brienza
F. Argenziano
Vincenzo Suriani
D. Bloisi
Daniele Nardi
LM&Ro
LLMAG
122
5
0
31 Dec 2024
M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios
Ning Liao
Xiaopeng Zhang
Minglu Cao
Junchi Yan
VPVLM
VLM
148
0
0
31 Dec 2024
Edicho: Consistent Image Editing in the Wild
Qingyan Bai
Hao Ouyang
Yinghao Xu
Qiuyu Wang
Ceyuan Yang
Ka Leong Cheng
Yujun Shen
Qifeng Chen
DiffM
127
1
0
30 Dec 2024
ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
Xiao Wang
Qingyi Si
Jianlong Wu
Shiyu Zhu
Zheng Lin
Liqiang Nie
VLM
152
8
0
29 Dec 2024
Defending Multimodal Backdoored Models by Repulsive Visual Prompt Tuning
Zhifang Zhang
Shuo He
Bingquan Shen
Lei Feng
Lei Feng
AAML
121
1
0
29 Dec 2024
UniRestorer: Universal Image Restoration via Adaptively Estimating Image Degradation at Proper Granularity
Jingbo Lin
Zhilu Zhang
Wenbo Li
Renjing Pei
Hang Xu
Hongzhi Zhang
Wangmeng Zuo
112
1
0
28 Dec 2024
DreamFit: Garment-Centric Human Generation via a Lightweight Anything-Dressing Encoder
Ente Lin
Xujie Zhang
Fuwei Zhao
Yuxuan Luo
Xin Dong
Long Zeng
Xiaodan Liang
VLM
DiffM
114
2
0
23 Dec 2024
A Bias-Free Training Paradigm for More General AI-generated Image Detection
Fabrizio Guillaro
Giada Zingarini
Ben Usman
Avneesh Sud
D. Cozzolino
L. Verdoliva
DiffM
137
7
0
23 Dec 2024
Kernel-Aware Graph Prompt Learning for Few-Shot Anomaly Detection
Fenfang Tao
G. Xie
Fang Zhao
Xiangbo Shu
107
3
0
23 Dec 2024
Neural-MCRL: Neural Multimodal Contrastive Representation Learning for EEG-based Visual Decoding
Yueyang Li
Zijian Kang
Shengyu Gong
Wenhao Dong
Weiming Zeng
Hongjie Yan
W. Siok
Nizhuan Wang
106
2
0
23 Dec 2024
Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation
Quan Dao
Hao Phung
T. Dao
Dimitris Metaxas
Anh Tran
162
1
0
22 Dec 2024
MVREC: A General Few-shot Defect Classification Model Using Multi-View Region-Context
Shuai Lyu
Fangjian Liao
Zeqi Ma
Rongchen Zhang
Dongmei Mo
W. Wong
142
1
0
22 Dec 2024
Where am I? Cross-View Geo-localization with Natural Language Descriptions
Junyan Ye
Honglin Lin
Leyan Ou
Dairong Chen
Zihao Wang
Zeang Sheng
Weijia Li
Weijia Li
197
0
0
22 Dec 2024
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
Xin Zhang
Yanzhao Zhang
Wen Xie
Mingxin Li
Ziqi Dai
Dingkun Long
Pengjun Xie
Meishan Zhang
Wenjie Li
Hao Fei
202
20
0
22 Dec 2024
Adapting Image-to-Video Diffusion Models for Large-Motion Frame Interpolation
Luoxu Jin
Hiroshi Watanabe
DiffM
VGen
223
0
0
22 Dec 2024
Visual Prompting with Iterative Refinement for Design Critique Generation
Peitong Duan
Chin-Yi Cheng
Bjoern Hartmann
Yang Li
135
0
0
22 Dec 2024
HyperNet Fields: Efficiently Training Hypernetworks without Ground Truth by Learning Weight Trajectories
Eric Hedlin
Munawar Hayat
Fatih Porikli
Kwang Moo Yi
Shweta Mahajan
3DH
133
0
0
22 Dec 2024
UNEM: UNrolled Generalized EM for Transductive Few-Shot Learning
Long Zhou
Fereshteh Shakeri
Aymen Sadraoui
Mounir Kaaniche
J. Pesquet
Ismail Ben Ayed
VLM
168
0
0
21 Dec 2024
Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models
Konstantin Donhauser
Kristina Ulicna
Gemma Elyse Moran
Aditya Ravuri
Kian Kenyon-Dean
Cian Eastwood
Jason Hartford
145
0
0
20 Dec 2024
Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization
Yue Zhang
Liqiang Jing
Vibhav Gogate
173
4
0
19 Dec 2024
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Ho Kei Cheng
Masato Ishii
Akio Hayakawa
Takashi Shibuya
Alex Schwing
Yuki Mitsufuji
VGen
270
18
0
19 Dec 2024
FlexCache: Flexible Approximate Cache System for Video Diffusion
Desen Sun
Henry Tian
Tim Lu
Sihang Liu
DiffM
130
1
0
18 Dec 2024
RelationField: Relate Anything in Radiance Fields
Sebastian Koch
Johanna Wald
Mirco Colosi
Narunas Vaskevicius
Pedro Hermosilla
F. Tombari
Timo Ropinski
143
1
0
18 Dec 2024
Towards Automatic Evaluation for Image Transcreation
Simran Khanuja
Vivek Iyer
Claire He
Graham Neubig
ViT
114
2
0
18 Dec 2024
Adversarial Hubness in Multi-Modal Retrieval
Tingwei Zhang
Fnu Suya
Rishi Jha
Collin Zhang
Vitaly Shmatikov
AAML
142
1
0
18 Dec 2024
MMO-IG: Multi-Class and Multi-Scale Object Image Generation for Remote Sensing
Chuang Yang
Bingxuan Zhao
Qing Zhou
Qi Wang
144
3
0
18 Dec 2024
Seeking Consistent Flat Minima for Better Domain Generalization via Refining Loss Landscapes
Aodi Li
Liansheng Zhuang
Xiao Long
Minghong Yao
Shafei Wang
485
1
0
18 Dec 2024
Attentive Eraser: Unleashing Diffusion Model's Object Removal Potential via Self-Attention Redirection Guidance
Wenhao Sun
Benlei Cui
Xue-Mei Dong
Jingqun Tang
DiffM
184
14
0
17 Dec 2024
Beyond Accuracy: On the Effects of Fine-tuning Towards Vision-Language Model's Prediction Rationality
Qitong Wang
Tang Li
Kien X. Nguyen
Xi Peng
166
0
0
17 Dec 2024
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models
Zihui Cheng
Qiguang Chen
Jin Zhang
Hao Fei
Xiaocheng Feng
Wanxiang Che
Min Li
L. Qin
VLM
MLLM
LRM
163
8
0
17 Dec 2024
GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding
Haoyi Jiang
Liu Liu
Tianheng Cheng
Xinjie Wang
Tianwei Lin
Zhizhong Su
Wen Liu
Xinyu Wang
3DGS
ViT
188
10
0
17 Dec 2024
Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference
Siyuan Wang
Dianyi Wang
Chengxing Zhou
Zejun Li
Zhihao Fan
Xuanjing Huang
Zhongyu Wei
VLM
483
0
0
17 Dec 2024
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
Renqiu Xia
Mingxing Li
Hancheng Ye
Wenjie Wu
Hongbin Zhou
...
Zeang Sheng
Botian Shi
Tao Chen
Junchi Yan
Bo Zhang
156
10
0
16 Dec 2024
Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech
Rui Liu
Shuwei He
Yifan Hu
Hong Li
VLM
140
3
0
16 Dec 2024
MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes
Ruijie Lu
Yixin Chen
Junfeng Ni
Baoxiong Jia
Yu Liu
Diwen Wan
Gang Zeng
Siyuan Huang
DiffM
197
4
0
16 Dec 2024
EditSplat: Multi-View Fusion and Attention-Guided Optimization for View-Consistent 3D Scene Editing with 3D Gaussian Splatting
Dong In Lee
Hyeongcheol Park
Jiyoung Seo
Eunbyung Park
Hyunje Park
Ha Dam Baek
Shin Sangheon
Sangmin kim
Sangpil Kim
3DGS
188
3
0
16 Dec 2024
Previous
1
2
3
...
15
16
17
...
33
34
35
Next