Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2001.07966
Cited By
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
22 January 2020
Di Qi
Lin Su
Jianwei Song
Edward Cui
Taroon Bharti
Arun Sacheti
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data"
50 / 53 papers shown
Title
UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting
Haoyuan Li
Yanpeng Zhou
Tao Tang
Jifei Song
Yihan Zeng
Michael C. Kampffmeyer
Hang Xu
Xiaodan Liang
3DGS
67
1
0
25 Feb 2025
CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance
Chu Myaet Thwal
Ye Lin Tun
Minh N. H. Nguyen
Eui-nam Huh
Choong Seon Hong
VLM
74
0
0
05 Dec 2024
Mobile Edge Intelligence for Large Language Models: A Contemporary Survey
Guanqiao Qu
Qiyuan Chen
Wei Wei
Zheng Lin
Xianhao Chen
Kaibin Huang
42
43
0
09 Jul 2024
From Text to Transformation: A Comprehensive Review of Large Language Models' Versatility
Pravneet Kaur
Gautam Siddharth Kashyap
Ankit Kumar
Md. Tabrez Nafis
Sandeep Kumar
Vikrant Shokeen
LM&MA
48
54
0
25 Feb 2024
GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models
Haicheng Liao
Huanming Shen
Zhenning Li
Chengyue Wang
Guofa Li
Yiming Bie
Chengzhong Xu
34
50
0
06 Dec 2023
Grid Jigsaw Representation with CLIP: A New Perspective on Image Clustering
Zijie Song
Zhenzhen Hu
Richang Hong
SSL
46
0
0
27 Oct 2023
SNeL: A Structured Neuro-Symbolic Language for Entity-Based Multimodal Scene Understanding
Silvan Ferreira
Allan Martins
Ivanovitch Silva
30
0
0
09 Jun 2023
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language
Shentong Mo
Jingfei Xia
Ihor Markevych
CLIP
VLM
16
1
0
10 Apr 2023
Probabilistic Prompt Learning for Dense Prediction
Hyeongjun Kwon
Taeyong Song
Somi Jeong
Jin-Hwa Kim
Jinhyun Jang
Kwanghoon Sohn
VLM
25
18
0
03 Apr 2023
Transformers in Speech Processing: A Survey
S. Latif
Aun Zaidi
Heriberto Cuayáhuitl
Fahad Shamshad
Moazzam Shoukat
Junaid Qadir
42
47
0
21 Mar 2023
Few-shot Multimodal Multitask Multilingual Learning
Aman Chadha
Vinija Jain
47
0
0
19 Feb 2023
End-to-end Semantic Object Detection with Cross-Modal Alignment
Silvan Ferreira
Allan Martins
Ivan S. S. Silva
ObjD
19
0
0
10 Feb 2023
Controlling for Stereotypes in Multimodal Language Model Evaluation
Manuj Malik
Richard Johansson
20
1
0
03 Feb 2023
Effective End-to-End Vision Language Pretraining with Semantic Visual Loss
Xiaofeng Yang
Fayao Liu
Guosheng Lin
VLM
24
7
0
18 Jan 2023
Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study
Mariya Hendriksen
Svitlana Vakulenko
E. Kuiper
Maarten de Rijke
31
5
0
12 Jan 2023
Position-guided Text Prompt for Vision-Language Pre-training
Alex Jinpeng Wang
Pan Zhou
Mike Zheng Shou
Shuicheng Yan
VLM
24
37
0
19 Dec 2022
TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities
Zhe Zhao
Yudong Li
Cheng-An Hou
Jing-xin Zhao
Rong Tian
...
Xingwu Sun
Zhanhui Kang
Xiaoyong Du
Linlin Shen
Kimmo Yan
VLM
41
23
0
13 Dec 2022
Masked Vision-Language Transformer in Fashion
Ge-Peng Ji
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Christos Sakaridis
Luc Van Gool
21
25
0
27 Oct 2022
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding
Fengyuan Shi
Ruopeng Gao
Weilin Huang
Limin Wang
27
23
0
28 Sep 2022
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
Xiaoyi Dong
Jianmin Bao
Yinglin Zheng
Ting Zhang
Dongdong Chen
...
Weiming Zhang
Lu Yuan
Dong Chen
Fang Wen
Nenghai Yu
CLIP
VLM
48
157
0
25 Aug 2022
Semantic-Enhanced Image Clustering
Shao-Qian Cai
Li-qing Qiu
Xiaojun Chen
Qin Zhang
Long Chen
VLM
29
13
0
21 Aug 2022
Masked Vision and Language Modeling for Multi-modal Representation Learning
Gukyeong Kwon
Zhaowei Cai
Avinash Ravichandran
Erhan Bas
Rahul Bhotika
Stefano Soatto
33
67
0
03 Aug 2022
Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics
Xiaoyuan Guo
Jiali Duan
C.-C. Jay Kuo
J. Gichoya
Imon Banerjee
VLM
19
1
0
31 Jul 2022
ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval
Nicola Messina
Matteo Stefanini
Marcella Cornia
Lorenzo Baraldi
Fabrizio Falchi
Giuseppe Amato
Rita Cucchiara
VLM
16
21
0
29 Jul 2022
Mimetic Models: Ethical Implications of AI that Acts Like You
Reid McIlroy-Young
Jon M. Kleinberg
S. Sen
Solon Barocas
Ashton Anderson
13
16
0
19 Jul 2022
Multimodal Learning with Transformers: A Survey
P. Xu
Xiatian Zhu
David A. Clifton
ViT
54
527
0
13 Jun 2022
TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving
Kashyap Chitta
Aditya Prakash
Bernhard Jaeger
Zehao Yu
Katrin Renz
Andreas Geiger
ViT
104
295
0
31 May 2022
Training and challenging models for text-guided fashion image retrieval
Eric Dodds
Jack Culpepper
Gaurav Srivastava
18
8
0
23 Apr 2022
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
Mengjun Cheng
Yipeng Sun
Long Wang
Xiongwei Zhu
Kun Yao
...
Guoli Song
Junyu Han
Jingtuo Liu
Errui Ding
Jingdong Wang
24
60
0
31 Mar 2022
Large-scale Bilingual Language-Image Contrastive Learning
ByungSoo Ko
Geonmo Gu
VLM
29
14
0
28 Mar 2022
Multi-modal Alignment using Representation Codebook
Jiali Duan
Liqun Chen
Son Tran
Jinyu Yang
Yi Xu
Belinda Zeng
Trishul Chilimbi
28
67
0
28 Feb 2022
Vision-Language Pre-Training with Triple Contrastive Learning
Jinyu Yang
Jiali Duan
Son N. Tran
Yi Xu
Sampath Chanda
Liqun Chen
Belinda Zeng
Trishul Chilimbi
Junzhou Huang
VLM
29
289
0
21 Feb 2022
VLP: A Survey on Vision-Language Pre-training
Feilong Chen
Duzhen Zhang
Minglun Han
Xiuyi Chen
Jing Shi
Shuang Xu
Bo Xu
VLM
82
213
0
18 Feb 2022
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Xizhou Zhu
Jinguo Zhu
Hao Li
Xiaoshi Wu
Xiaogang Wang
Hongsheng Li
Xiaohua Wang
Jifeng Dai
50
129
0
02 Dec 2021
Florence: A New Foundation Model for Computer Vision
Lu Yuan
Dongdong Chen
Yi-Ling Chen
Noel Codella
Xiyang Dai
...
Zhen Xiao
Jianwei Yang
Michael Zeng
Luowei Zhou
Pengchuan Zhang
VLM
29
879
0
22 Nov 2021
FILIP: Fine-grained Interactive Language-Image Pre-Training
Lewei Yao
Runhu Huang
Lu Hou
Guansong Lu
Minzhe Niu
Hang Xu
Xiaodan Liang
Zhenguo Li
Xin Jiang
Chunjing Xu
VLM
CLIP
30
614
0
09 Nov 2021
M5Product: Self-harmonized Contrastive Learning for E-commercial Multi-modal Pretraining
Xiao Dong
Xunlin Zhan
Yangxin Wu
Yunchao Wei
Michael C. Kampffmeyer
Xiaoyong Wei
Minlong Lu
Yaowei Wang
Xiaodan Liang
30
36
0
09 Sep 2021
Video Transformer for Deepfake Detection with Incremental Learning
Sohail Ahmed Khan
Hang Dai
ViT
21
62
0
11 Aug 2021
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq R. Joty
Caiming Xiong
S. Hoi
FaML
56
1,886
0
16 Jul 2021
Pre-Trained Models: Past, Present and Future
Xu Han
Zhengyan Zhang
Ning Ding
Yuxian Gu
Xiao Liu
...
Jie Tang
Ji-Rong Wen
Jinhui Yuan
Wayne Xin Zhao
Jun Zhu
AIFin
MQ
AI4MH
37
815
0
14 Jun 2021
M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis via Non-Autoregressive Generative Transformers
Zhu Zhang
Jianxin Ma
Chang Zhou
Rui Men
Zhikang Li
Ming Ding
Jie Tang
Jingren Zhou
Hongxia Yang
25
46
0
29 May 2021
Recent Advances in Deep Learning Based Dialogue Systems: A Systematic Survey
Jinjie Ni
Tom Young
Vlad Pandelea
Fuzhao Xue
Erik Cambria
54
268
0
10 May 2021
Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
Aditya Prakash
Kashyap Chitta
Andreas Geiger
ViT
33
509
0
19 Apr 2021
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
Krishna Srinivasan
K. Raman
Jiecao Chen
Michael Bendersky
Marc Najork
VLM
208
310
0
02 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
114
27,682
0
26 Feb 2021
Predicting times of waiting on red signals using BERT
Witold Szejgis
Anna Warno
P. Góra
21
1
0
20 Feb 2021
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs
Emanuele Bugliarello
Ryan Cotterell
Naoaki Okazaki
Desmond Elliott
29
119
0
30 Nov 2020
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Simon Ging
Mohammadreza Zolfaghari
Hamed Pirsiavash
Thomas Brox
ViT
CLIP
13
168
0
01 Nov 2020
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
Jaemin Cho
Jiasen Lu
Dustin Schwenk
Hannaneh Hajishirzi
Aniruddha Kembhavi
VLM
MLLM
19
102
0
23 Sep 2020
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
Fei Yu
Jiji Tang
Weichong Yin
Yu Sun
Hao Tian
Hua-Hong Wu
Haifeng Wang
13
375
0
30 Jun 2020
1
2
Next