ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.02265
  4. Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
    SSLVLM
ArXiv (abs)PDFHTML

Papers citing "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"

50 / 2,119 papers shown
Title
CLIP2TV: Align, Match and Distill for Video-Text Retrieval
CLIP2TV: Align, Match and Distill for Video-Text Retrieval
Zijian Gao
Qingbin Liu
Weiqi Sun
S. Chen
Dedan Chang
Lili Zhao
VLMCLIP
65
17
0
10 Nov 2021
LUMINOUS: Indoor Scene Generation for Embodied AI Challenges
LUMINOUS: Indoor Scene Generation for Embodied AI Challenges
Yizhou Zhao
Kaixiang Lin
Zhiwei Jia
Qiaozi Gao
Govind Thattai
Jesse Thomason
Gaurav Sukhatme
3DVLM&Ro
56
16
0
10 Nov 2021
FILIP: Fine-grained Interactive Language-Image Pre-Training
FILIP: Fine-grained Interactive Language-Image Pre-Training
Lewei Yao
Runhu Huang
Lu Hou
Guansong Lu
Minzhe Niu
Hang Xu
Xiaodan Liang
Zhenguo Li
Xin Jiang
Chunjing Xu
VLMCLIP
113
644
0
09 Nov 2021
A Survey on Green Deep Learning
A Survey on Green Deep Learning
Jingjing Xu
Wangchunshu Zhou
Zhiyi Fu
Hao Zhou
Lei Li
VLM
203
84
0
08 Nov 2021
NarrationBot and InfoBot: A Hybrid System for Automated Video Description
Shasta Ihorn
Y. Siu
Aditya Bodi
Lothar D Narins
Jose M. Castanon
Yash Kant
Abhishek Das
Ilmi Yoon
Pooyan Fazli
46
3
0
07 Nov 2021
Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language
  Modeling
Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling
Renrui Zhang
Rongyao Fang
Wei Zhang
Peng Gao
Kunchang Li
Jifeng Dai
Yu Qiao
Hongsheng Li
VLM
292
405
0
06 Nov 2021
The Curious Layperson: Fine-Grained Image Recognition without Expert
  Labels
The Curious Layperson: Fine-Grained Image Recognition without Expert Labels
Subhabrata Choudhury
Iro Laina
Christian Rupprecht
Andrea Vedaldi
VLM
80
10
0
05 Nov 2021
An Empirical Study of Training End-to-End Vision-and-Language
  Transformers
An Empirical Study of Training End-to-End Vision-and-Language Transformers
Zi-Yi Dou
Yichong Xu
Zhe Gan
Jianfeng Wang
Shuohang Wang
...
Pengchuan Zhang
Lu Yuan
Nanyun Peng
Zicheng Liu
Michael Zeng
VLM
106
381
0
03 Nov 2021
VLMo: Unified Vision-Language Pre-Training with
  Mixture-of-Modality-Experts
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
Hangbo Bao
Wenhui Wang
Li Dong
Qiang Liu
Owais Khan Mohammed
Kriti Aggarwal
Subhojit Som
Furu Wei
VLMMLLMMoE
104
560
0
03 Nov 2021
Revisiting spatio-temporal layouts for compositional action recognition
Revisiting spatio-temporal layouts for compositional action recognition
Gorjan Radevski
Marie-Francine Moens
Tinne Tuytelaars
104
26
0
02 Nov 2021
Masking Modalities for Cross-modal Video Retrieval
Masking Modalities for Cross-modal Video Retrieval
Valentin Gabeur
Arsha Nagrani
Chen Sun
Alahari Karteek
Cordelia Schmid
88
30
0
01 Nov 2021
With a Little Help from my Temporal Context: Multimodal Egocentric
  Action Recognition
With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition
Evangelos Kazakos
Jaesung Huh
Arsha Nagrani
Andrew Zisserman
Dima Damen
EgoV
122
46
0
01 Nov 2021
Cross-Modality Fusion Transformer for Multispectral Object Detection
Cross-Modality Fusion Transformer for Multispectral Object Detection
Q. Fang
D. Han
Zhaokui Wang
ViT
101
157
0
30 Oct 2021
Fusing ASR Outputs in Joint Training for Speech Emotion Recognition
Fusing ASR Outputs in Joint Training for Speech Emotion Recognition
Yuanchao Li
P. Bell
Catherine Lai
93
58
0
29 Oct 2021
Pay attention to emoji: Feature Fusion Network with EmoGraph2vec Model
  for Sentiment Analysis
Pay attention to emoji: Feature Fusion Network with EmoGraph2vec Model for Sentiment Analysis
Xiaowei Yuan
Jingyuan Hu
Xiaodan Zhang
Honglei Lv
GNN
31
4
0
27 Oct 2021
TriBERT: Full-body Human-centric Audio-visual Representation Learning
  for Visual Sound Separation
TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation
Tanzila Rahman
Mengyu Yang
Leonid Sigal
ViT
75
8
0
26 Oct 2021
History Aware Multimodal Transformer for Vision-and-Language Navigation
History Aware Multimodal Transformer for Vision-and-Language Navigation
Shizhe Chen
Pierre-Louis Guhur
Cordelia Schmid
Ivan Laptev
LM&Ro
88
236
0
25 Oct 2021
IconQA: A New Benchmark for Abstract Diagram Understanding and Visual
  Language Reasoning
IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning
Pan Lu
Liang Qiu
Jiaqi Chen
Tony Xia
Yizhou Zhao
Wei Zhang
Zhou Yu
Xiaodan Liang
Song-Chun Zhu
AIMat
173
206
0
25 Oct 2021
Alignment Attention by Matching Key and Query Distributions
Alignment Attention by Matching Key and Query Distributions
Shujian Zhang
Xinjie Fan
Huangjie Zheng
Korawat Tanwisuth
Mingyuan Zhou
OOD
122
10
0
25 Oct 2021
Multimodal Learning using Optimal Transport for Sarcasm and Humor
  Detection
Multimodal Learning using Optimal Transport for Sarcasm and Humor Detection
Shraman Pramanick
A. Roy
Vishal M. Patel
82
58
0
21 Oct 2021
Text-Based Person Search with Limited Data
Text-Based Person Search with Limited Data
Xiaoping Han
Sen He
Li Zhang
Tao Xiang
94
91
0
20 Oct 2021
VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal
  Retrieval
VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval
Lisai Zhang
Hongfa Wu
Qingcai Chen
Yimeng Deng
Zhonghua Li
Dejiang Kong
Bo Zhao
Joanna Siebert
Yunpeng Han
ViTVLM
100
21
0
20 Oct 2021
StructFormer: Learning Spatial Structure for Language-Guided Semantic
  Rearrangement of Novel Objects
StructFormer: Learning Spatial Structure for Language-Guided Semantic Rearrangement of Novel Objects
Weiyu Liu
Chris Paxton
Tucker Hermans
Dieter Fox
109
94
0
19 Oct 2021
Unifying Multimodal Transformer for Bi-directional Image and Text
  Generation
Unifying Multimodal Transformer for Bi-directional Image and Text Generation
Yupan Huang
Hongwei Xue
Bei Liu
Yutong Lu
84
59
0
19 Oct 2021
Self-Supervised Representation Learning: Introduction, Advances and
  Challenges
Self-Supervised Representation Learning: Introduction, Advances and Challenges
Linus Ericsson
Henry Gouk
Chen Change Loy
Timothy M. Hospedales
SSLOODAI4TS
96
281
0
18 Oct 2021
Understanding Multimodal Procedural Knowledge by Sequencing Multimodal
  Instructional Manuals
Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals
Te-Lin Wu
Alexander Spangher
Pegah Alipoormolabashi
Marjorie Freedman
R. Weischedel
Nanyun Peng
78
23
0
16 Oct 2021
Unsupervised Natural Language Inference Using PHL Triplet Generation
Unsupervised Natural Language Inference Using PHL Triplet Generation
Neeraj Varshney
Pratyay Banerjee
Tejas Gokhale
Chitta Baral
78
9
0
16 Oct 2021
On Learning the Transformer Kernel
On Learning the Transformer Kernel
Sankalan Pal Chowdhury
Adamos Solomou
Kumar Avinava Dubey
Mrinmaya Sachan
ViT
131
14
0
15 Oct 2021
Few-Shot Bot: Prompt-Based Learning for Dialogue Systems
Few-Shot Bot: Prompt-Based Learning for Dialogue Systems
Andrea Madotto
Zhaojiang Lin
Genta Indra Winata
Pascale Fung
95
85
0
15 Oct 2021
Semantically Distributed Robust Optimization for Vision-and-Language
  Inference
Semantically Distributed Robust Optimization for Vision-and-Language Inference
Tejas Gokhale
A. Chaudhary
Pratyay Banerjee
Chitta Baral
Yezhou Yang
126
17
0
14 Oct 2021
Understanding of Emotion Perception from Art
Understanding of Emotion Perception from Art
Digbalay Bose
Krishna Somandepalli
Souvik Kundu
Rimita Lahiri
Jonathan Gratch
Shrikanth Narayanan
29
5
0
13 Oct 2021
MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants
MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants
Alkesh Patel
Joel Ruben Antony Moniz
R. Nguyen
Nicholas Tzou
Hadas Kotek
Vincent Renkens
VGen
32
1
0
13 Oct 2021
ALL Dolphins Are Intelligent and SOME Are Friendly: Probing BERT for
  Nouns' Semantic Properties and their Prototypicality
ALL Dolphins Are Intelligent and SOME Are Friendly: Probing BERT for Nouns' Semantic Properties and their Prototypicality
Marianna Apidianaki
Aina Garí Soler
78
18
0
12 Oct 2021
Supervision Exists Everywhere: A Data Efficient Contrastive
  Language-Image Pre-training Paradigm
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm
Yangguang Li
Feng Liang
Lichen Zhao
Yufeng Cui
Wanli Ouyang
Jing Shao
F. Yu
Junjie Yan
VLMCLIP
167
458
0
11 Oct 2021
Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$
  Videos
Pano-AVQA: Grounded Audio-Visual Question Answering on 360∘^\circ∘ Videos
Heeseung Yun
Youngjae Yu
Wonsuk Yang
Kangil Lee
Gunhee Kim
100
86
0
11 Oct 2021
CLIP-Adapter: Better Vision-Language Models with Feature Adapters
CLIP-Adapter: Better Vision-Language Models with Feature Adapters
Peng Gao
Shijie Geng
Renrui Zhang
Teli Ma
Rongyao Fang
Yongfeng Zhang
Hongsheng Li
Yu Qiao
VLMCLIP
355
1,066
0
09 Oct 2021
Is An Image Worth Five Sentences? A New Look into Semantics for
  Image-Text Matching
Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching
Ali Furkan Biten
Andrés Mafla
Lluís Gómez
Dimosthenis Karatzas
257
18
0
06 Oct 2021
Efficient Multi-Modal Embeddings from Structured Data
Efficient Multi-Modal Embeddings from Structured Data
A. Vero
Ann A. Copestake
35
4
0
06 Oct 2021
Word Acquisition in Neural Language Models
Word Acquisition in Neural Language Models
Tyler A. Chang
Benjamin Bergen
90
40
0
05 Oct 2021
A Survey On Neural Word Embeddings
A Survey On Neural Word Embeddings
Erhan Sezerer
Selma Tekir
AI4TS
86
13
0
05 Oct 2021
ProTo: Program-Guided Transformer for Program-Guided Tasks
ProTo: Program-Guided Transformer for Program-Guided Tasks
Zelin Zhao
Karan Samel
Binghong Chen
Le Song
ViTLM&Ro
98
30
0
02 Oct 2021
Visually Grounded Concept Composition
Visually Grounded Concept Composition
Bowen Zhang
Hexiang Hu
Linlu Qiu
Peter Shaw
Fei Sha
CoGe
122
6
0
29 Sep 2021
Visually Grounded Reasoning across Languages and Cultures
Visually Grounded Reasoning across Languages and Cultures
Fangyu Liu
Emanuele Bugliarello
Edoardo Ponti
Siva Reddy
Nigel Collier
Desmond Elliott
VLMLRM
175
180
0
28 Sep 2021
Audio-to-Image Cross-Modal Generation
Audio-to-Image Cross-Modal Generation
Maciej Żelaszczyk
Jacek Mańdziuk
DiffM
118
17
0
27 Sep 2021
VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual
  Question Answering
VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual Question Answering
Ekta Sood
Fabian Kögel
Florian Strohm
Prajit Dhar
Andreas Bulling
67
19
0
27 Sep 2021
Why Do We Click: Visual Impression-aware News Recommendation
Why Do We Click: Visual Impression-aware News Recommendation
Jiahao Xun
Shengyu Zhang
Zhou Zhao
Jieming Zhu
Qi Zhang
Jingjie Li
Xiuqiang He
Xiaofei He
Tat-Seng Chua
Leilei Gan
155
33
0
26 Sep 2021
Systematic Generalization on gSCAN: What is Nearly Solved and What is
  Next?
Systematic Generalization on gSCAN: What is Nearly Solved and What is Next?
Linlu Qiu
Hexiang Hu
Bowen Zhang
Peter Shaw
Fei Sha
86
21
0
25 Sep 2021
MLIM: Vision-and-Language Model Pre-training with Masked Language and
  Image Modeling
MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling
Tarik Arici
M. S. Seyfioglu
T. Neiman
Yi Tian Xu
Son N. Tran
Trishul Chilimbi
Belinda Zeng
Ismail B. Tutar
VLM
67
15
0
24 Sep 2021
CLIPort: What and Where Pathways for Robotic Manipulation
CLIPort: What and Where Pathways for Robotic Manipulation
Mohit Shridhar
Lucas Manuelli
Dieter Fox
LM&Ro
165
661
0
24 Sep 2021
Detecting Harmful Memes and Their Targets
Detecting Harmful Memes and Their Targets
Shraman Pramanick
Dimitar Dimitrov
Rituparna Mukherjee
Shivam Sharma
Md. Shad Akhtar
Preslav Nakov
Tanmoy Chakraborty
84
117
0
24 Sep 2021
Previous
123...313233...414243
Next