ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.02265
  4. Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
    SSL
    VLM
ArXivPDFHTML

Papers citing "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"

50 / 2,093 papers shown
Title
History for Visual Dialog: Do we really need it?
History for Visual Dialog: Do we really need it?
Shubham Agarwal
Trung Bui
Joon-Young Lee
Ioannis Konstas
Verena Rieser
VLM
19
69
0
08 May 2020
MISA: Modality-Invariant and -Specific Representations for Multimodal
  Sentiment Analysis
MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis
Devamanyu Hazarika
Roger Zimmermann
Soujanya Poria
21
675
0
07 May 2020
Cross-media Structured Common Space for Multimedia Event Extraction
Cross-media Structured Common Space for Multimedia Event Extraction
Manling Li
Alireza Zareian
Qi Zeng
Spencer Whitehead
Di Lu
Heng Ji
Shih-Fu Chang
10
103
0
05 May 2020
Words aren't enough, their order matters: On the Robustness of Grounding
  Visual Referring Expressions
Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions
Arjun Reddy Akula
Spandana Gella
Yaser Al-Onaizan
Song-Chun Zhu
Siva Reddy
ObjD
26
52
0
04 May 2020
Visually Grounded Continual Learning of Compositional Phrases
Visually Grounded Continual Learning of Compositional Phrases
Xisen Jin
Junyi Du
Arka Sadhu
Ram Nevatia
Xiang Ren
CLL
14
4
0
02 May 2020
Probing Contextual Language Models for Common Ground with Visual
  Representations
Probing Contextual Language Models for Common Ground with Visual Representations
Gabriel Ilharco
Rowan Zellers
Ali Farhadi
Hannaneh Hajishirzi
30
14
0
01 May 2020
Visuo-Linguistic Question Answering (VLQA) Challenge
Visuo-Linguistic Question Answering (VLQA) Challenge
Shailaja Keyur Sampat
Yezhou Yang
Chitta Baral
CoGe
13
1
0
01 May 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation
  Pre-training
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLM
VLM
OffRL
AI4TS
46
493
0
01 May 2020
Crisscrossed Captions: Extended Intramodal and Intermodal Semantic
  Similarity Judgments for MS-COCO
Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO
Zarana Parekh
Jason Baldridge
Daniel Cer
Austin Waters
Yinfei Yang
19
61
0
30 Apr 2020
Improving Vision-and-Language Navigation with Image-Text Pairs from the
  Web
Improving Vision-and-Language Navigation with Image-Text Pairs from the Web
Arjun Majumdar
Ayush Shrivastava
Stefan Lee
Peter Anderson
Devi Parikh
Dhruv Batra
LM&Ro
47
230
0
30 Apr 2020
Span-based Localizing Network for Natural Language Video Localization
Span-based Localizing Network for Natural Language Video Localization
Hao Zhang
Aixin Sun
Wei Jing
Qiufeng Wang
32
312
0
29 Apr 2020
Heterogeneous Representation Learning: A Review
Heterogeneous Representation Learning: A Review
Qiufeng Wang
Xi Peng
Yew-Soon Ong
6
0
0
28 Apr 2020
VD-BERT: A Unified Vision and Dialog Transformer with BERT
VD-BERT: A Unified Vision and Dialog Transformer with BERT
Yue Wang
Chenyu You
Michael R. Lyu
Irwin King
Caiming Xiong
Guosheng Lin
24
102
0
28 Apr 2020
Deep Multimodal Neural Architecture Search
Deep Multimodal Neural Architecture Search
Zhou Yu
Yuhao Cui
Jun-chen Yu
Meng Wang
Dacheng Tao
Qi Tian
16
98
0
25 Apr 2020
VisualCOMET: Reasoning about the Dynamic Context of a Still Image
VisualCOMET: Reasoning about the Dynamic Context of a Still Image
J. S. Park
Chandra Bhagavatula
Roozbeh Mottaghi
Ali Farhadi
Yejin Choi
ReLM
LRM
27
6
0
22 Apr 2020
Experience Grounds Language
Experience Grounds Language
Yonatan Bisk
Ari Holtzman
Jesse Thomason
Jacob Andreas
Yoshua Bengio
...
Angeliki Lazaridou
Jonathan May
Aleksandr Nisnevich
Nicolas Pinto
Joseph P. Turian
21
351
0
21 Apr 2020
Transformer Reasoning Network for Image-Text Matching and Retrieval
Transformer Reasoning Network for Image-Text Matching and Retrieval
Nicola Messina
Fabrizio Falchi
Andrea Esuli
Giuseppe Amato
ViT
30
58
0
20 Apr 2020
Are we pretraining it right? Digging deeper into visio-linguistic
  pretraining
Are we pretraining it right? Digging deeper into visio-linguistic pretraining
Amanpreet Singh
Vedanuj Goswami
Devi Parikh
VLM
40
48
0
19 Apr 2020
lamBERT: Language and Action Learning Using Multimodal BERT
lamBERT: Language and Action Learning Using Multimodal BERT
Kazuki Miyazawa
Tatsuya Aoki
Takato Horii
Takayuki Nagai
SSL
LM&Ro
21
12
0
15 Apr 2020
Coreferential Reasoning Learning for Language Representation
Coreferential Reasoning Learning for Language Representation
Deming Ye
Yankai Lin
Jiaju Du
Zhenghao Liu
Peng Li
Maosong Sun
Zhiyuan Liu
34
177
0
15 Apr 2020
Relation Transformer Network
Relation Transformer Network
Rajat Koner
Poulami Sinhamahapatra
Volker Tresp
ViT
21
32
0
13 Apr 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
...
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
VLM
17
1,917
0
13 Apr 2020
An Entropy Clustering Approach for Assessing Visual Question Difficulty
An Entropy Clustering Approach for Assessing Visual Question Difficulty
K. Terao
Toru Tamaki
B. Raytchev
K. Kaneda
Shuníchi Satoh
OOD
AAML
26
1
0
12 Apr 2020
Rephrasing visual questions by specifying the entropy of the answer
  distribution
Rephrasing visual questions by specifying the entropy of the answer distribution
K. Terao
Toru Tamaki
B. Raytchev
K. Kaneda
S. Satoh
OOD
24
2
0
10 Apr 2020
Multimodal Categorization of Crisis Events in Social Media
Multimodal Categorization of Crisis Events in Social Media
Mahdi Abavisani
Liwei Wu
Shengli Hu
Joel R. Tetreault
A. Jaimes
29
87
0
10 Apr 2020
Learning to Scale Multilingual Representations for Vision-Language Tasks
Learning to Scale Multilingual Representations for Vision-Language Tasks
Andrea Burns
Donghyun Kim
Derry Wijaya
Kate Saenko
Bryan A. Plummer
15
35
0
09 Apr 2020
Context-Aware Group Captioning via Self-Attention and Contrastive
  Features
Context-Aware Group Captioning via Self-Attention and Contrastive Features
Zhuowan Li
Quan Hung Tran
Long Mai
Zhe-nan Lin
Alan Yuille
VLM
14
44
0
07 Apr 2020
TAPAS: Weakly Supervised Table Parsing via Pre-training
TAPAS: Weakly Supervised Table Parsing via Pre-training
Jonathan Herzig
Pawel Krzysztof Nowak
Thomas Müller
Francesco Piccinno
Julian Martin Eisenschlos
LMTD
RALM
45
634
0
05 Apr 2020
Generating Rationales in Visual Question Answering
Generating Rationales in Visual Question Answering
Hammad A. Ayyubi
Md. Mehrab Tanjim
Julian McAuley
G. Cottrell
LRM
22
5
0
04 Apr 2020
XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training,
  Understanding and Generation
XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation
Yaobo Liang
Nan Duan
Yeyun Gong
Ning Wu
Fenfei Guo
...
Shuguang Liu
Fan Yang
Daniel Fernando Campos
Rangan Majumder
Ming Zhou
ELM
VLM
63
342
0
03 Apr 2020
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal
  Transformers
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
Zhicheng Huang
Zhaoyang Zeng
Bei Liu
Dongmei Fu
Jianlong Fu
ViT
50
436
0
02 Apr 2020
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference
J. Liu
Wenhu Chen
Yu Cheng
Zhe Gan
Licheng Yu
Yiming Yang
Jingjing Liu
MLLM
VGen
43
68
0
25 Mar 2020
Pre-trained Models for Natural Language Processing: A Survey
Pre-trained Models for Natural Language Processing: A Survey
Xipeng Qiu
Tianxiang Sun
Yige Xu
Yunfan Shao
Ning Dai
Xuanjing Huang
LM&MA
VLM
243
1,452
0
18 Mar 2020
Deconfounded Image Captioning: A Causal Retrospect
Deconfounded Image Captioning: A Causal Retrospect
Xu Yang
Hanwang Zhang
Jianfei Cai
CML
12
118
0
09 Mar 2020
Cross-modal Learning for Multi-modal Video Categorization
Cross-modal Learning for Multi-modal Video Categorization
Palash Goyal
Saurabh Sahu
Shalini Ghosh
Chul Lee
13
8
0
07 Mar 2020
XGPT: Cross-modal Generative Pre-Training for Image Captioning
XGPT: Cross-modal Generative Pre-Training for Image Captioning
Qiaolin Xia
Haoyang Huang
Nan Duan
Dongdong Zhang
Lei Ji
Zhifang Sui
Edward Cui
Taroon Bharti
Xin Liu
Ming Zhou
MLLM
VLM
25
74
0
03 Mar 2020
Visual Commonsense R-CNN
Visual Commonsense R-CNN
Tan Wang
Jianqiang Huang
Hanwang Zhang
Qianru Sun
SSL
ObjD
CML
18
245
0
27 Feb 2020
What BERT Sees: Cross-Modal Transfer for Visual Question Generation
What BERT Sees: Cross-Modal Transfer for Visual Question Generation
Thomas Scialom
Patrick Bordes
Paul-Alexis Dray
Jacopo Staiano
Patrick Gallinari
25
6
0
25 Feb 2020
Towards Learning a Generic Agent for Vision-and-Language Navigation via
  Pre-training
Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training
Weituo Hao
Chunyuan Li
Xiujun Li
Lawrence Carin
Jianfeng Gao
LM&Ro
18
274
0
25 Feb 2020
Measuring Social Biases in Grounded Vision and Language Embeddings
Measuring Social Biases in Grounded Vision and Language Embeddings
Candace Ross
Boris Katz
Andrei Barbu
19
63
0
20 Feb 2020
Contextual Lensing of Universal Sentence Representations
Contextual Lensing of Universal Sentence Representations
J. Kiros
15
5
0
20 Feb 2020
VQA-LOL: Visual Question Answering under the Lens of Logic
VQA-LOL: Visual Question Answering under the Lens of Logic
Tejas Gokhale
Pratyay Banerjee
Chitta Baral
Yezhou Yang
CoGe
25
73
0
19 Feb 2020
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Zhangyin Feng
Daya Guo
Duyu Tang
Nan Duan
Xiaocheng Feng
...
Linjun Shou
Bing Qin
Ting Liu
Daxin Jiang
Ming Zhou
68
2,533
0
19 Feb 2020
UniVL: A Unified Video and Language Pre-Training Model for Multimodal
  Understanding and Generation
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
Huaishao Luo
Lei Ji
Botian Shi
Haoyang Huang
Nan Duan
Tianrui Li
Jason Li
Xilin Chen
Ming Zhou
VLM
46
439
0
15 Feb 2020
Fine-Tuning Pretrained Language Models: Weight Initializations, Data
  Orders, and Early Stopping
Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping
Jesse Dodge
Gabriel Ilharco
Roy Schwartz
Ali Farhadi
Hannaneh Hajishirzi
Noah A. Smith
41
584
0
15 Feb 2020
Exploiting Temporal Coherence for Multi-modal Video Categorization
Exploiting Temporal Coherence for Multi-modal Video Categorization
Palash Goyal
Saurabh Sahu
Shalini Ghosh
Chul Lee
20
1
0
07 Feb 2020
Retrospective Reader for Machine Reading Comprehension
Retrospective Reader for Machine Reading Comprehension
Zhuosheng Zhang
Junjie Yang
Hai Zhao
RALM
25
226
0
27 Jan 2020
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised
  Image-Text Data
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
Di Qi
Lin Su
Jianwei Song
Edward Cui
Taroon Bharti
Arun Sacheti
VLM
40
259
0
22 Jan 2020
Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models
Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models
M. Farazi
Salman H. Khan
Nick Barnes
23
17
0
20 Jan 2020
In Defense of Grid Features for Visual Question Answering
In Defense of Grid Features for Visual Question Answering
Huaizu Jiang
Ishan Misra
Marcus Rohrbach
Erik Learned-Miller
Xinlei Chen
OOD
ObjD
23
318
0
10 Jan 2020
Previous
123...404142
Next