Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2102.08981
Cited By
v1
v2 (latest)
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
17 February 2021
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts"
50 / 871 papers shown
Title
SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation
R. Ramos
Bruno Martins
Desmond Elliott
Yova Kementchedjhieva
VLM
89
89
0
30 Sep 2022
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
Bin Shan
Weichong Yin
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
VLM
75
19
0
30 Sep 2022
FreeSeg: Free Mask from Interpretable Contrastive Language-Image Pretraining for Semantic Segmentation
Yi Li
Huifeng Yao
Hualiang Wang
Xuelong Li
ISeg
VLM
93
3
0
27 Sep 2022
UniCLIP: Unified Framework for Contrastive Language-Image Pre-training
Janghyeon Lee
Jongsuk Kim
Hyounguk Shon
Bumsoo Kim
Seung Wook Kim
Honglak Lee
Junmo Kim
CLIP
VLM
146
58
0
27 Sep 2022
Paraphrasing Is All You Need for Novel Object Captioning
Cheng Yang
Yao-Hung Hubert Tsai
Wanshu Fan
Ruslan Salakhutdinov
Louis-Philippe Morency
Yu-Chiang Frank Wang
80
4
0
25 Sep 2022
LGDN: Language-Guided Denoising Network for Video-Language Modeling
Haoyu Lu
Mingyu Ding
Nanyi Fei
Yuqi Huo
Zhiwu Lu
VLM
148
16
0
23 Sep 2022
Implementing and Experimenting with Diffusion Models for Text-to-Image Generation
Robin Zbinden
42
3
0
22 Sep 2022
LAVIS: A Library for Language-Vision Intelligence
Dongxu Li
Junnan Li
Hung Le
Guangsen Wang
Silvio Savarese
Guosheng Lin
VLM
192
56
0
15 Sep 2022
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
Junke Wang
Dongdong Chen
Zuxuan Wu
Chong Luo
Luowei Zhou
Yucheng Zhao
Yujia Xie
Ce Liu
Yu-Gang Jiang
Lu Yuan
MLLM
VLM
136
153
0
15 Sep 2022
Exploring Visual Interpretability for Contrastive Language-Image Pre-training
Yi Li
Hualiang Wang
Yiqun Duan
Han Xu
Xiaomeng Li
CLIP
VLM
153
27
0
15 Sep 2022
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering
Jingjing Jiang
Zi-yi Liu
Nanning Zheng
89
8
0
14 Sep 2022
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Xi Chen
Tianlin Li
Soravit Changpinyo
A. Piergiovanni
Piotr Padlewski
...
Andreas Steiner
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
MLLM
VLM
205
741
0
14 Sep 2022
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
Hongwei Xue
Yuchong Sun
Bei Liu
Jianlong Fu
Rui Song
Houqiang Li
Jiebo Luo
CLIP
VLM
120
70
0
14 Sep 2022
StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation
A. Maharana
Darryl Hannan
Joey Tianyi Zhou
DiffM
112
83
0
13 Sep 2022
PreSTU: Pre-Training for Scene-Text Understanding
Jihyung Kil
Soravit Changpinyo
Xi Chen
Hexiang Hu
Sebastian Goodman
Wei-Lun Chao
Radu Soricut
VLM
191
29
0
12 Sep 2022
MaXM: Towards Multilingual Visual Question Answering
Soravit Changpinyo
Linting Xue
Michal Yarom
Ashish V. Thapliyal
Idan Szpektor
J. Amelot
Xi Chen
Radu Soricut
108
8
0
12 Sep 2022
OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network
Tiancheng Zhao
Peng Liu
Kyusong Lee
VLM
MLLM
ObjD
42
5
0
10 Sep 2022
Design of the topology for contrastive visual-textual alignment
Zhun Sun
95
1
0
05 Sep 2022
IMG2IMU: Translating Knowledge from Large-Scale Images to IMU Sensing Applications
Hyungjun Yoon
Hyeong-Tae Cha
Hoang C. Nguyen
Taesik Gong
Sungyeop Lee
VLM
SSL
106
1
0
02 Sep 2022
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment
Mustafa Shukor
Guillaume Couairon
Matthieu Cord
VLM
CLIP
100
27
0
29 Aug 2022
Multimedia Generative Script Learning for Task Planning
Qingyun Wang
Manling Li
Hou Pong Chan
Lifu Huang
Julia Hockenmaier
Girish Chowdhary
Heng Ji
VGen
117
13
0
25 Aug 2022
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
Xiaoyi Dong
Jianmin Bao
Yinglin Zheng
Ting Zhang
Dongdong Chen
...
Weiming Zhang
Lu Yuan
Dong Chen
Fang Wen
Nenghai Yu
CLIP
VLM
113
167
0
25 Aug 2022
MuMUR : Multilingual Multimodal Universal Retrieval
Avinash Madasu
Estelle Aflalo
Gabriela Ben-Melech Stan
Shachar Rosenman
Shao-Yen Tseng
Gedas Bertasius
Vasudev Lal
117
3
0
24 Aug 2022
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
...
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
MLLM
VLM
ViT
157
645
0
22 Aug 2022
Multimodal foundation models are better simulators of the human brain
Haoyu Lu
Qiongyi Zhou
Nanyi Fei
Zhiwu Lu
Mingyu Ding
...
Changde Du
Xin Zhao
Haoran Sun
Huiguang He
J. Wen
AI4CE
85
13
0
17 Aug 2022
Deception for Cyber Defence: Challenges and Opportunities
David Liebowitz
Surya Nepal
Kristen Moore
Cody James Christopher
S. Kanhere
David D. Nguyen
Roelien C. Timmer
Michael Longland
Keerth Rathakumar
70
10
0
15 Aug 2022
Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning
J. Hu
Roberto Cavicchioli
Alessandro Capotondi
128
22
0
13 Aug 2022
Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP
Thao Nguyen
Gabriel Ilharco
Mitchell Wortsman
Sewoong Oh
Ludwig Schmidt
CLIP
VLM
180
108
0
10 Aug 2022
Visual Recognition by Request
Chufeng Tang
Lingxi Xie
Xiaopeng Zhang
Xiaolin Hu
Qi Tian
VLM
93
15
0
28 Jul 2022
Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment
Qiang Chen
Xiaokang Chen
Jian Wang
Shan Zhang
Kun Yao
Haocheng Feng
Junyu Han
Errui Ding
Gang Zeng
Jingdong Wang
ViT
143
135
0
26 Jul 2022
Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding
Quan Liu
Youpeng Wen
Jianhua Han
Chunjing Xu
Hang Xu
Xiaodan Liang
VLM
151
70
0
18 Jul 2022
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval
Jinbin Bai
Chunhui Liu
Feiyue Ni
Haofan Wang
Mengying Hu
Xiaofeng Guo
Lele Cheng
102
11
0
11 Jul 2022
American == White in Multimodal Language-and-Image AI
Robert Wolfe
Aylin Caliskan
VLM
85
51
0
01 Jul 2022
VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations
Tiancheng Zhao
Tianqi Zhang
Mingwei Zhu
Haozhan Shen
Kyusong Lee
Xiaopeng Lu
Jianwei Yin
VLM
CoGe
MLLM
116
99
0
01 Jul 2022
Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations
Ziyan Yang
Kushal Kafle
Franck Dernoncourt
Vicente Ordónez Román
VLM
90
25
0
30 Jun 2022
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu
Yuanzhong Xu
Jing Yu Koh
Thang Luong
Gunjan Baid
...
Zarana Parekh
Xin Li
Han Zhang
Jason Baldridge
Yonghui Wu
EGVM
223
1,134
0
22 Jun 2022
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
ObjD
VLM
MLLM
171
412
0
17 Jun 2022
Self-Supervised Contrastive Pre-Training For Time Series via Time-Frequency Consistency
Xiang Zhang
Ziyuan Zhao
Theodoros Tsiligkaridis
Marinka Zitnik
AI4TS
139
293
0
17 Jun 2022
MixGen: A New Multi-Modal Data Augmentation
Xiaoshuai Hao
Yi Zhu
Srikar Appalaraju
Aston Zhang
Wanqian Zhang
Boyang Li
Mu Li
VLM
113
90
0
16 Jun 2022
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Chung-Ching Lin
Zicheng Liu
Ce Liu
Lijuan Wang
MLLM
VLM
90
84
0
14 Jun 2022
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation
Matt Deitke
Eli VanderBilt
Alvaro Herrasti
Luca Weihs
Jordi Salvador
...
Winson Han
Eric Kolve
Ali Farhadi
Aniruddha Kembhavi
Roozbeh Mottaghi
LM&Ro
124
265
0
14 Jun 2022
Multimodal Learning with Transformers: A Survey
Peng Xu
Xiatian Zhu
David Clifton
ViT
236
575
0
13 Jun 2022
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
Jinguo Zhu
Xizhou Zhu
Wenhai Wang
Xiaohua Wang
Hongsheng Li
Xiaogang Wang
Jifeng Dai
MoMe
MoE
96
70
0
09 Jun 2022
Revealing Single Frame Bias for Video-and-Language Learning
Jie Lei
Tamara L. Berg
Joey Tianyi Zhou
96
115
0
07 Jun 2022
Delving into the Openness of CLIP
Shuhuai Ren
Lei Li
Xuancheng Ren
Guangxiang Zhao
Xu Sun
VLM
92
13
0
04 Jun 2022
Prefix Conditioning Unifies Language and Label Supervision
Kuniaki Saito
Kihyuk Sohn
Xinming Zhang
Chun-Liang Li
Chen-Yu Lee
Kate Saenko
Tomas Pfister
VLM
CLIP
99
16
0
02 Jun 2022
Improved Vector Quantized Diffusion Models
Zhicong Tang
Shuyang Gu
Jianmin Bao
Dong Chen
Fang Wen
DiffM
246
63
0
31 May 2022
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Wangchunshu Zhou
Yan Zeng
Shizhe Diao
Xinsong Zhang
CoGe
VLM
97
13
0
30 May 2022
CyCLIP: Cyclic Contrastive Language-Image Pretraining
Shashank Goel
Hritik Bansal
S. Bhatia
Ryan Rossi
Vishwa Vinay
Aditya Grover
CLIP
VLM
280
140
0
28 May 2022
Multimodal Masked Autoencoders Learn Transferable Representations
Xinyang Geng
Hao Liu
Lisa Lee
Dale Schuurams
Sergey Levine
Pieter Abbeel
93
119
0
27 May 2022
Previous
1
2
3
...
15
16
17
18
Next