Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.02265
Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"
50 / 2,119 papers shown
Title
CLIP2TV: Align, Match and Distill for Video-Text Retrieval
Zijian Gao
Qingbin Liu
Weiqi Sun
S. Chen
Dedan Chang
Lili Zhao
VLM
CLIP
65
17
0
10 Nov 2021
LUMINOUS: Indoor Scene Generation for Embodied AI Challenges
Yizhou Zhao
Kaixiang Lin
Zhiwei Jia
Qiaozi Gao
Govind Thattai
Jesse Thomason
Gaurav Sukhatme
3DV
LM&Ro
56
16
0
10 Nov 2021
FILIP: Fine-grained Interactive Language-Image Pre-Training
Lewei Yao
Runhu Huang
Lu Hou
Guansong Lu
Minzhe Niu
Hang Xu
Xiaodan Liang
Zhenguo Li
Xin Jiang
Chunjing Xu
VLM
CLIP
113
644
0
09 Nov 2021
A Survey on Green Deep Learning
Jingjing Xu
Wangchunshu Zhou
Zhiyi Fu
Hao Zhou
Lei Li
VLM
203
84
0
08 Nov 2021
NarrationBot and InfoBot: A Hybrid System for Automated Video Description
Shasta Ihorn
Y. Siu
Aditya Bodi
Lothar D Narins
Jose M. Castanon
Yash Kant
Abhishek Das
Ilmi Yoon
Pooyan Fazli
46
3
0
07 Nov 2021
Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling
Renrui Zhang
Rongyao Fang
Wei Zhang
Peng Gao
Kunchang Li
Jifeng Dai
Yu Qiao
Hongsheng Li
VLM
292
405
0
06 Nov 2021
The Curious Layperson: Fine-Grained Image Recognition without Expert Labels
Subhabrata Choudhury
Iro Laina
Christian Rupprecht
Andrea Vedaldi
VLM
80
10
0
05 Nov 2021
An Empirical Study of Training End-to-End Vision-and-Language Transformers
Zi-Yi Dou
Yichong Xu
Zhe Gan
Jianfeng Wang
Shuohang Wang
...
Pengchuan Zhang
Lu Yuan
Nanyun Peng
Zicheng Liu
Michael Zeng
VLM
106
381
0
03 Nov 2021
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
Hangbo Bao
Wenhui Wang
Li Dong
Qiang Liu
Owais Khan Mohammed
Kriti Aggarwal
Subhojit Som
Furu Wei
VLM
MLLM
MoE
104
560
0
03 Nov 2021
Revisiting spatio-temporal layouts for compositional action recognition
Gorjan Radevski
Marie-Francine Moens
Tinne Tuytelaars
104
26
0
02 Nov 2021
Masking Modalities for Cross-modal Video Retrieval
Valentin Gabeur
Arsha Nagrani
Chen Sun
Alahari Karteek
Cordelia Schmid
88
30
0
01 Nov 2021
With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition
Evangelos Kazakos
Jaesung Huh
Arsha Nagrani
Andrew Zisserman
Dima Damen
EgoV
122
46
0
01 Nov 2021
Cross-Modality Fusion Transformer for Multispectral Object Detection
Q. Fang
D. Han
Zhaokui Wang
ViT
101
157
0
30 Oct 2021
Fusing ASR Outputs in Joint Training for Speech Emotion Recognition
Yuanchao Li
P. Bell
Catherine Lai
93
58
0
29 Oct 2021
Pay attention to emoji: Feature Fusion Network with EmoGraph2vec Model for Sentiment Analysis
Xiaowei Yuan
Jingyuan Hu
Xiaodan Zhang
Honglei Lv
GNN
31
4
0
27 Oct 2021
TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation
Tanzila Rahman
Mengyu Yang
Leonid Sigal
ViT
75
8
0
26 Oct 2021
History Aware Multimodal Transformer for Vision-and-Language Navigation
Shizhe Chen
Pierre-Louis Guhur
Cordelia Schmid
Ivan Laptev
LM&Ro
88
236
0
25 Oct 2021
IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning
Pan Lu
Liang Qiu
Jiaqi Chen
Tony Xia
Yizhou Zhao
Wei Zhang
Zhou Yu
Xiaodan Liang
Song-Chun Zhu
AIMat
173
206
0
25 Oct 2021
Alignment Attention by Matching Key and Query Distributions
Shujian Zhang
Xinjie Fan
Huangjie Zheng
Korawat Tanwisuth
Mingyuan Zhou
OOD
122
10
0
25 Oct 2021
Multimodal Learning using Optimal Transport for Sarcasm and Humor Detection
Shraman Pramanick
A. Roy
Vishal M. Patel
82
58
0
21 Oct 2021
Text-Based Person Search with Limited Data
Xiaoping Han
Sen He
Li Zhang
Tao Xiang
94
91
0
20 Oct 2021
VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval
Lisai Zhang
Hongfa Wu
Qingcai Chen
Yimeng Deng
Zhonghua Li
Dejiang Kong
Bo Zhao
Joanna Siebert
Yunpeng Han
ViT
VLM
100
21
0
20 Oct 2021
StructFormer: Learning Spatial Structure for Language-Guided Semantic Rearrangement of Novel Objects
Weiyu Liu
Chris Paxton
Tucker Hermans
Dieter Fox
109
94
0
19 Oct 2021
Unifying Multimodal Transformer for Bi-directional Image and Text Generation
Yupan Huang
Hongwei Xue
Bei Liu
Yutong Lu
84
59
0
19 Oct 2021
Self-Supervised Representation Learning: Introduction, Advances and Challenges
Linus Ericsson
Henry Gouk
Chen Change Loy
Timothy M. Hospedales
SSL
OOD
AI4TS
96
281
0
18 Oct 2021
Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals
Te-Lin Wu
Alexander Spangher
Pegah Alipoormolabashi
Marjorie Freedman
R. Weischedel
Nanyun Peng
78
23
0
16 Oct 2021
Unsupervised Natural Language Inference Using PHL Triplet Generation
Neeraj Varshney
Pratyay Banerjee
Tejas Gokhale
Chitta Baral
78
9
0
16 Oct 2021
On Learning the Transformer Kernel
Sankalan Pal Chowdhury
Adamos Solomou
Kumar Avinava Dubey
Mrinmaya Sachan
ViT
131
14
0
15 Oct 2021
Few-Shot Bot: Prompt-Based Learning for Dialogue Systems
Andrea Madotto
Zhaojiang Lin
Genta Indra Winata
Pascale Fung
95
85
0
15 Oct 2021
Semantically Distributed Robust Optimization for Vision-and-Language Inference
Tejas Gokhale
A. Chaudhary
Pratyay Banerjee
Chitta Baral
Yezhou Yang
126
17
0
14 Oct 2021
Understanding of Emotion Perception from Art
Digbalay Bose
Krishna Somandepalli
Souvik Kundu
Rimita Lahiri
Jonathan Gratch
Shrikanth Narayanan
29
5
0
13 Oct 2021
MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants
Alkesh Patel
Joel Ruben Antony Moniz
R. Nguyen
Nicholas Tzou
Hadas Kotek
Vincent Renkens
VGen
32
1
0
13 Oct 2021
ALL Dolphins Are Intelligent and SOME Are Friendly: Probing BERT for Nouns' Semantic Properties and their Prototypicality
Marianna Apidianaki
Aina Garí Soler
78
18
0
12 Oct 2021
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm
Yangguang Li
Feng Liang
Lichen Zhao
Yufeng Cui
Wanli Ouyang
Jing Shao
F. Yu
Junjie Yan
VLM
CLIP
167
458
0
11 Oct 2021
Pano-AVQA: Grounded Audio-Visual Question Answering on 360
∘
^\circ
∘
Videos
Heeseung Yun
Youngjae Yu
Wonsuk Yang
Kangil Lee
Gunhee Kim
100
86
0
11 Oct 2021
CLIP-Adapter: Better Vision-Language Models with Feature Adapters
Peng Gao
Shijie Geng
Renrui Zhang
Teli Ma
Rongyao Fang
Yongfeng Zhang
Hongsheng Li
Yu Qiao
VLM
CLIP
355
1,066
0
09 Oct 2021
Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching
Ali Furkan Biten
Andrés Mafla
Lluís Gómez
Dimosthenis Karatzas
257
18
0
06 Oct 2021
Efficient Multi-Modal Embeddings from Structured Data
A. Vero
Ann A. Copestake
35
4
0
06 Oct 2021
Word Acquisition in Neural Language Models
Tyler A. Chang
Benjamin Bergen
90
40
0
05 Oct 2021
A Survey On Neural Word Embeddings
Erhan Sezerer
Selma Tekir
AI4TS
86
13
0
05 Oct 2021
ProTo: Program-Guided Transformer for Program-Guided Tasks
Zelin Zhao
Karan Samel
Binghong Chen
Le Song
ViT
LM&Ro
98
30
0
02 Oct 2021
Visually Grounded Concept Composition
Bowen Zhang
Hexiang Hu
Linlu Qiu
Peter Shaw
Fei Sha
CoGe
122
6
0
29 Sep 2021
Visually Grounded Reasoning across Languages and Cultures
Fangyu Liu
Emanuele Bugliarello
Edoardo Ponti
Siva Reddy
Nigel Collier
Desmond Elliott
VLM
LRM
175
180
0
28 Sep 2021
Audio-to-Image Cross-Modal Generation
Maciej Żelaszczyk
Jacek Mańdziuk
DiffM
118
17
0
27 Sep 2021
VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual Question Answering
Ekta Sood
Fabian Kögel
Florian Strohm
Prajit Dhar
Andreas Bulling
67
19
0
27 Sep 2021
Why Do We Click: Visual Impression-aware News Recommendation
Jiahao Xun
Shengyu Zhang
Zhou Zhao
Jieming Zhu
Qi Zhang
Jingjie Li
Xiuqiang He
Xiaofei He
Tat-Seng Chua
Leilei Gan
155
33
0
26 Sep 2021
Systematic Generalization on gSCAN: What is Nearly Solved and What is Next?
Linlu Qiu
Hexiang Hu
Bowen Zhang
Peter Shaw
Fei Sha
86
21
0
25 Sep 2021
MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling
Tarik Arici
M. S. Seyfioglu
T. Neiman
Yi Tian Xu
Son N. Tran
Trishul Chilimbi
Belinda Zeng
Ismail B. Tutar
VLM
67
15
0
24 Sep 2021
CLIPort: What and Where Pathways for Robotic Manipulation
Mohit Shridhar
Lucas Manuelli
Dieter Fox
LM&Ro
165
661
0
24 Sep 2021
Detecting Harmful Memes and Their Targets
Shraman Pramanick
Dimitar Dimitrov
Rituparna Mukherjee
Shivam Sharma
Md. Shad Akhtar
Preslav Nakov
Tanmoy Chakraborty
84
117
0
24 Sep 2021
Previous
1
2
3
...
31
32
33
...
41
42
43
Next