Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2004.00849
Cited By
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
2 April 2020
Zhicheng Huang
Zhaoyang Zeng
Bei Liu
Dongmei Fu
Jianlong Fu
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers"
50 / 287 papers shown
Title
High-Performance Transformers for Table Structure Recognition Need Early Convolutions
Sheng-Hsuan Peng
Seongmin Lee
Xiaojing Wang
Rajarajeswari Balasubramaniyan
Duen Horng Chau
ViT
LMTD
24
3
0
09 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
43
36
0
01 Nov 2023
Grid Jigsaw Representation with CLIP: A New Perspective on Image Clustering
Zijie Song
Zhenzhen Hu
Richang Hong
SSL
46
0
0
27 Oct 2023
Multiscale Superpixel Structured Difference Graph Convolutional Network for VL Representation
Siyu Zhang
Ye-Ting Chen
Fang Wang
Yaoru Sun
Jun Yang
Lizhi Bai
SSL
30
0
0
20 Oct 2023
RSAdapter: Adapting Multimodal Models for Remote Sensing Visual Question Answering
Yuduo Wang
Pedram Ghamisi
30
4
0
19 Oct 2023
VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models
Ziyi Yin
Muchao Ye
Tianrong Zhang
Tianyu Du
Jinguo Zhu
Han Liu
Jinghui Chen
Ting Wang
Fenglong Ma
AAML
VLM
CoGe
33
36
0
07 Oct 2023
Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction
Yiren Jian
Tingkai Liu
Yunzhe Tao
Chunhui Zhang
Soroush Vosoughi
HX Yang
VLM
20
7
0
05 Oct 2023
Object-Centric Open-Vocabulary Image-Retrieval with Aggregated Features
Hila Levi
Guy Heller
Dan Levi
Ethan Fetaya
OCL
VLM
27
3
0
26 Sep 2023
A Survey on Image-text Multimodal Models
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai Le-Duc
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
VLM
31
5
0
23 Sep 2023
Synthetic Boost: Leveraging Synthetic Data for Enhanced Vision-Language Segmentation in Echocardiography
Rabin Adhikari
Manish Dhakal
Safal Thapaliya
K. Poudel
Prasiddha Bhandari
Bishesh Khanal
25
7
0
22 Sep 2023
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
LRM
36
24
0
08 Sep 2023
Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models
Qiong Wu
Wei Yu
Yiyi Zhou
Shubin Huang
Xiaoshuai Sun
Rongrong Ji
VLM
26
7
0
04 Sep 2023
RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model
Fengxiang Bie
Yibo Yang
Zhongzhu Zhou
Adam Ghanem
Minjia Zhang
...
Pareesa Ameneh Golnari
David A. Clifton
Yuxiong He
Dacheng Tao
Shuaiwen Leon Song
EGVM
33
19
0
02 Sep 2023
ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
Weihan Wang
Zhengyuan Yang
Bin Xu
Juanzi Li
Yankui Sun
VLM
28
8
0
31 Aug 2023
Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning
Ye-Ting Chen
Siyu Zhang
Yaoru Sun
Weijian Liang
Haoran Wang
40
0
0
18 Aug 2023
Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models
K. Poudel
Manish Dhakal
Prasiddha Bhandari
Rabin Adhikari
Safal Thapaliya
Bishesh Khanal
VLM
30
17
0
15 Aug 2023
LOIS: Looking Out of Instance Semantics for Visual Question Answering
Siyu Zhang
Ye Chen
Yaoru Sun
Fang Wang
Haibo Shi
Haoran Wang
25
4
0
26 Jul 2023
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
Chaoyang Zhu
Long Chen
ObjD
VLM
31
32
0
18 Jul 2023
BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization
Chaoya Jiang
Haiyang Xu
Wei Ye
Qinghao Ye
Chenliang Li
Mingshi Yan
Bin Bi
Shikun Zhang
Fei Huang
Songfang Huang
VLM
31
9
0
17 Jul 2023
PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese
Nghia Hieu Nguyen
Kiet Van Nguyen
13
2
0
17 Jul 2023
SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
Yi-Syuan Chen
Yun-Zhu Song
Cheng Yu Yeo
Bei Liu
Jianlong Fu
Hong-Han Shuai
VLM
LRM
26
4
0
15 Jul 2023
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Yiren Jian
Chongyang Gao
Soroush Vosoughi
VLM
MLLM
32
25
0
13 Jul 2023
One-Versus-Others Attention: Scalable Multimodal Integration for Clinical Data
Michal Golovanevsky
Eva Schiller
Akira Nair
Ritambhara Singh
Carsten Eickhoff
19
2
0
11 Jul 2023
Vision Language Transformers: A Survey
Clayton Fields
C. Kennington
VLM
28
5
0
06 Jul 2023
C
2
\mathbf{C}^2
C
2
Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection
Maoxun Yuan
Xingxing Wei
ViT
25
38
0
28 Jun 2023
Towards Open Vocabulary Learning: A Survey
Jianzong Wu
Xiangtai Li
Shilin Xu
Haobo Yuan
Henghui Ding
...
Jiangning Zhang
Yu Tong
Xudong Jiang
Guohao Li
Dacheng Tao
ObjD
VLM
34
136
0
28 Jun 2023
Approximated Prompt Tuning for Vision-Language Pre-trained Models
Qiong Wu
Shubin Huang
Yiyi Zhou
Pingyang Dai
Annan Shu
Guannan Jiang
Rongrong Ji
VLM
VPVLM
25
2
0
27 Jun 2023
Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input
Qingpei Guo
Kaisheng Yao
Wei Chu
MLLM
22
4
0
25 Jun 2023
Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training
Chong Liu
Yuqi Zhang
Hongsong Wang
Weihua Chen
F. Wang
Yan Huang
Yixing Shen
Liang Wang
19
25
0
15 Jun 2023
Image Captioners Are Scalable Vision Learners Too
Michael Tschannen
Manoj Kumar
Andreas Steiner
Xiaohua Zhai
N. Houlsby
Lucas Beyer
VLM
CLIP
26
53
0
13 Jun 2023
Global and Local Semantic Completion Learning for Vision-Language Pre-training
Rong-Cheng Tu
Yatai Ji
Jie Jiang
Weijie Kong
Chengfei Cai
Wenzhe Zhao
Hongfa Wang
Yujiu Yang
Wei Liu
VLM
24
2
0
12 Jun 2023
A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks
Saidul Islam
Hanae Elmekki
Ahmed Elsebai
Jamal Bentahar
Najat Drawel
Gaith Rjoub
Witold Pedrycz
ViT
MedIm
24
171
0
11 Jun 2023
Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting
Shubin Huang
Qiong Wu
Yiyi Zhou
Weijie Chen
Rongsheng Zhang
Xiaoshuai Sun
Rongrong Ji
VLM
VPVLM
LRM
16
0
0
01 Jun 2023
BIG-C: a Multimodal Multi-Purpose Dataset for Bemba
Claytone Sikasote
Eunice Mukonde
Md Mahfuz Ibn Alam
Antonios Anastasopoulos
28
6
0
26 May 2023
MEMEX: Detecting Explanatory Evidence for Memes via Knowledge-Enriched Contextualization
Shivam Sharma
S Ramaneswaran
Udit Arora
Md. Shad Akhtar
Tanmoy Chakraborty
32
9
0
25 May 2023
UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning
Ahmed Masry
P. Kavehzadeh
Do Xuan Long
Enamul Hoque
Chenyu You
LRM
27
100
0
24 May 2023
Training Transitive and Commutative Multimodal Transformers with LoReTTa
Manuel Tran
Yashin Dicente Cid
Amal Lahiani
Fabian J. Theis
Tingying Peng
Eldad Klaiman
26
2
0
23 May 2023
UNIMO-3: Multi-granularity Interaction for Vision-Language Representation Learning
Hao Yang
Can Gao
Hao Liu
Xinyan Xiao
Yanyan Zhao
Bing Qin
28
2
0
23 May 2023
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
Xingjian He
Sihan Chen
Fan Ma
Zhicheng Huang
Xiaojie Jin
Zikang Liu
Dongmei Fu
Yi Yang
Jiaheng Liu
Jiashi Feng
VLM
CLIP
23
17
0
22 May 2023
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
Zikang Liu
Sihan Chen
Longteng Guo
Handong Li
Xingjian He
Jiaheng Liu
15
1
0
19 May 2023
Rethinking Multimodal Content Moderation from an Asymmetric Angle with Mixed-modality
Jialing Yuan
Ye Yu
Gaurav Mittal
Matthew Hall
Sandra Sajeev
Mei Chen
19
9
0
17 May 2023
ChatGPT-Like Large-Scale Foundation Models for Prognostics and Health Management: A Survey and Roadmaps
Yanfang Li
Huan Wang
Muxia Sun
LM&MA
AI4TS
AI4CE
29
46
0
10 May 2023
Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation
Chaoya Jiang
Wei Ye
Haiyang Xu
Miang yan
Shikun Zhang
Jie Zhang
Fei Huang
VLM
34
15
0
08 May 2023
OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in Vietnamese
Nghia Hieu Nguyen
Duong T.D. Vo
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
26
18
0
07 May 2023
Click-Feedback Retrieval
Zeyu Wang
Yuehua Wu
27
0
0
28 Apr 2023
Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining
Giacomo Nebbia
Adriana Kovashka
19
0
0
25 Apr 2023
Movie Box Office Prediction With Self-Supervised and Visually Grounded Pretraining
Qin Chao
Eunsoo Kim
Boyang Albert Li
21
1
0
20 Apr 2023
Is Cross-modal Information Retrieval Possible without Training?
Hyunjin Choi
HyunJae Lee
Seongho Joe
Youngjune Gwon
17
0
0
20 Apr 2023
MoMo: A shared encoder Model for text, image and multi-Modal representations
Rakesh Chada
Zhao-Heng Zheng
P. Natarajan
ViT
19
4
0
11 Apr 2023
Self-Supervised Multimodal Learning: A Survey
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
24
43
0
31 Mar 2023
Previous
1
2
3
4
5
6
Next