ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.02265
  4. Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
    SSLVLM
ArXiv (abs)PDFHTML

Papers citing "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"

50 / 2,119 papers shown
Title
Towards Unifying Reference Expression Generation and Comprehension
Towards Unifying Reference Expression Generation and Comprehension
Duo Zheng
Tao Kong
Ya Jing
Jiaan Wang
Xiaojie Wang
ObjD
64
6
0
24 Oct 2022
Visualizing the Obvious: A Concreteness-based Ensemble Model for Noun
  Property Prediction
Visualizing the Obvious: A Concreteness-based Ensemble Model for Noun Property Prediction
Yue Yang
Artemis Panagopoulou
Marianna Apidianaki
Mark Yatskar
Chris Callison-Burch
113
2
0
24 Oct 2022
When Can Transformers Ground and Compose: Insights from Compositional
  Generalization Benchmarks
When Can Transformers Ground and Compose: Insights from Compositional Generalization Benchmarks
Ankur Sikarwar
Arkil Patel
Navin Goyal
ViT
98
11
0
23 Oct 2022
McQueen: a Benchmark for Multimodal Conversational Query Rewrite
McQueen: a Benchmark for Multimodal Conversational Query Rewrite
Yifei Yuan
Chen Shi
Runze Wang
Liyi Chen
Feijun Jiang
Yuan You
W. Lam
41
6
0
23 Oct 2022
Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination
Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination
Yue Yang
Wenlin Yao
Hongming Zhang
Xiaoyang Wang
Dong Yu
Jianshu Chen
VLM
99
22
0
21 Oct 2022
SpaBERT: A Pretrained Language Model from Geographic Data for Geo-Entity
  Representation
SpaBERT: A Pretrained Language Model from Geographic Data for Geo-Entity Representation
Zekun Li
Jina Kim
Yao-Yi Chiang
Muhao Chen
133
32
0
21 Oct 2022
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun
  Dependencies?
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?
Mitja Nikolaus
Emmanuelle Salin
Stéphane Ayache
Abdellah Fourtassi
Benoit Favre
88
14
0
21 Oct 2022
Dissecting Deep Metric Learning Losses for Image-Text Retrieval
Dissecting Deep Metric Learning Losses for Image-Text Retrieval
Hong Xuan
Xi Chen
73
2
0
21 Oct 2022
Communication breakdown: On the low mutual intelligibility between human
  and neural captioning
Communication breakdown: On the low mutual intelligibility between human and neural captioning
Roberto Dessì
Eleonora Gualdoni
Francesca Franzon
Gemma Boleda
Marco Baroni
VLM
120
6
0
20 Oct 2022
Image-Text Retrieval with Binary and Continuous Label Supervision
Image-Text Retrieval with Binary and Continuous Label Supervision
Zheng Li
Caili Guo
Zerun Feng
Lei Li
Ying Jin
Yufeng Zhang
VLM
73
4
0
20 Oct 2022
Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text
  Generation
Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation
Yu Zhao
Jianguo Wei
Zhichao Lin
Yueheng Sun
Meishan Zhang
Hao Fei
79
16
0
20 Oct 2022
Prompting through Prototype: A Prototype-based Prompt Learning on
  Pretrained Vision-Language Models
Prompting through Prototype: A Prototype-based Prompt Learning on Pretrained Vision-Language Models
Yue Zhang
Hongliang Fei
Dingcheng Li
Tan Yu
Ping Li
VPVLMVLM
71
9
0
19 Oct 2022
Grounded Video Situation Recognition
Grounded Video Situation Recognition
Zeeshan Khan
C. V. Jawahar
Makarand Tapaswi
102
14
0
19 Oct 2022
VTC: Improving Video-Text Retrieval with User Comments
VTC: Improving Video-Text Retrieval with User Comments
Laura Hanu
James Thewlis
Yuki M. Asano
Christian Rupprecht
VGen
123
8
0
19 Oct 2022
TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun
  Distillation
TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation
Pengfei Li
Beiwen Tian
Yongliang Shi
Xiaoxue Chen
Hao Zhao
Guyue Zhou
Ya Zhang
125
22
0
19 Oct 2022
LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine
  Translation
LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine Translation
Hongcheng Guo
Jiaheng Liu
Haoyang Huang
Jian Yang
Zhoujun Li
Dongdong Zhang
Zheng Cui
Furu Wei
93
22
0
19 Oct 2022
Dense but Efficient VideoQA for Intricate Compositional Reasoning
Dense but Efficient VideoQA for Intricate Compositional Reasoning
Jihyeon Janel Lee
Wooyoung Kang
Eun-Sol Kim
CoGe
59
4
0
19 Oct 2022
Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual
  Question Answering
Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering
Jialin Wu
Raymond J. Mooney
RALM
140
11
0
18 Oct 2022
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text
Zifeng Wang
Zhenbang Wu
Dinesh Agarwal
Jimeng Sun
CLIPVLMMedIm
138
436
0
18 Oct 2022
Probing Cross-modal Semantics Alignment Capability from the Textual
  Perspective
Probing Cross-modal Semantics Alignment Capability from the Textual Perspective
Zheng Ma
Shi Zong
Mianzhi Pan
Jianbing Zhang
Shujian Huang
Xinyu Dai
Jiajun Chen
61
4
0
18 Oct 2022
Contrastive Language-Image Pre-Training with Knowledge Graphs
Contrastive Language-Image Pre-Training with Knowledge Graphs
Xuran Pan
Tianzhu Ye
Dongchen Han
S. Song
Gao Huang
VLMCLIP
83
54
0
17 Oct 2022
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models
  with Zero Training
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
A. M. H. Tiong
Junnan Li
Boyang Albert Li
Silvio Savarese
Guosheng Lin
MLLM
133
109
0
17 Oct 2022
COFAR: Commonsense and Factual Reasoning in Image Search
COFAR: Commonsense and Factual Reasoning in Image Search
Prajwal Gatti
A. S. Penamakuri
Revant Teotia
Anand Mishra
Shubhashis Sengupta
Roshni Ramnani
ReLMLRM
39
4
0
16 Oct 2022
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge
  Distillation and Modal-adaptive Pruning
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning
Tiannan Wang
Wangchunshu Zhou
Yan Zeng
Xinsong Zhang
VLM
82
44
0
14 Oct 2022
Plausible May Not Be Faithful: Probing Object Hallucination in
  Vision-Language Pre-training
Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training
Wenliang Dai
Zihan Liu
Ziwei Ji
Jane Polak Scowcroft
Pascale Fung
MLLMVLM
101
67
0
14 Oct 2022
MMTSA: Multimodal Temporal Segment Attention Network for Efficient Human
  Activity Recognition
MMTSA: Multimodal Temporal Segment Attention Network for Efficient Human Activity Recognition
Ziqi Gao
Yuntao wang
Jianguo Chen
Junliang Xing
Shwetak N. Patel
Xin Liu
Yuanchun Shi
70
3
0
14 Oct 2022
DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic
  Search-Free Low-Rank Adaptation
DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation
Mojtaba Valipour
Mehdi Rezagholizadeh
I. Kobyzev
A. Ghodsi
170
185
0
14 Oct 2022
Can Language Representation Models Think in Bets?
Can Language Representation Models Think in Bets?
Zhi–Bin Tang
Mayank Kejriwal
55
6
0
14 Oct 2022
SQA3D: Situated Question Answering in 3D Scenes
SQA3D: Situated Question Answering in 3D Scenes
Xiaojian Ma
Silong Yong
Zilong Zheng
Qing Li
Yitao Liang
Song-Chun Zhu
Siyuan Huang
LM&Ro
97
160
0
14 Oct 2022
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for
  Vision-Language Few-Shot Prompting
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting
Oscar Manas
Pau Rodríguez López
Saba Ahmadi
Aida Nematzadeh
Yash Goyal
Aishwarya Agrawal
VLMVPVLM
65
51
0
13 Oct 2022
OpenCQA: Open-ended Question Answering with Charts
OpenCQA: Open-ended Question Answering with Charts
Shankar Kantharaj
Do Xuan Long
Rixie Tiffany Ko Leong
J. Tan
Enamul Hoque
Shafiq Joty
85
53
0
12 Oct 2022
One does not fit all! On the Complementarity of Vision Encoders for
  Vision and Language Tasks
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks
Gregor Geigle
Chen Cecilia Liu
Jonas Pfeiffer
Iryna Gurevych
VLM
72
1
0
12 Oct 2022
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich
  Document Understanding
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding
Qiming Peng
Yinxu Pan
Wenjin Wang
Bin Luo
Zhenyu Zhang
...
Shi Feng
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
83
83
0
12 Oct 2022
Hate-CLIPper: Multimodal Hateful Meme Classification based on
  Cross-modal Interaction of CLIP Features
Hate-CLIPper: Multimodal Hateful Meme Classification based on Cross-modal Interaction of CLIP Features
Gokul Karthik Kumar
Karthik Nandakumar
VLMCLIP
97
66
0
12 Oct 2022
Understanding Embodied Reference with Touch-Line Transformer
Understanding Embodied Reference with Touch-Line Transformer
Yongqian Li
Xiaoxue Chen
Hao Zhao
Jiangtao Gong
Guyue Zhou
Federico Rossano
Yixin Zhu
177
17
0
11 Oct 2022
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
Yatai Ji
Junjie Wang
Yuan Gong
Lin Zhang
Yan Zhu
Hongfa Wang
Jiaxing Zhang
Tetsuya Sakai
Yujiu Yang
MLLM
82
33
0
11 Oct 2022
Generating Executable Action Plans with Environmentally-Aware Language
  Models
Generating Executable Action Plans with Environmentally-Aware Language Models
Maitrey Gramopadhye
D. Szafir
LM&RoLLMAG
104
24
0
10 Oct 2022
Transformer-based Localization from Embodied Dialog with Large-scale
  Pre-training
Transformer-based Localization from Embodied Dialog with Large-scale Pre-training
Meera Hahn
James M. Rehg
LM&Ro
104
4
0
10 Oct 2022
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature
  Alignment
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Shraman Pramanick
Li Jing
Sayan Nag
Jiachen Zhu
Hardik Shah
Yann LeCun
Ramalingam Chellappa
90
22
0
09 Oct 2022
Learning Fine-Grained Visual Understanding for Video Question Answering
  via Decoupling Spatial-Temporal Modeling
Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling
Hsin-Ying Lee
Hung-Ting Su
Bing-Chen Tsai
Tsung-Han Wu
Jia-Fong Yeh
Winston H. Hsu
105
2
0
08 Oct 2022
Visualize Before You Write: Imagination-Guided Open-Ended Text
  Generation
Visualize Before You Write: Imagination-Guided Open-Ended Text Generation
Wanrong Zhu
An Yan
Yujie Lu
Wenda Xu
Xinze Wang
Miguel P. Eckstein
William Yang Wang
137
36
0
07 Oct 2022
VLSNR:Vision-Linguistics Coordination Time Sequence-aware News
  Recommendation
VLSNR:Vision-Linguistics Coordination Time Sequence-aware News Recommendation
Songhao Han
Wei-Ping Huang
Xiaotian Luan Beihang University
AI4TS
81
3
0
06 Oct 2022
MuRAG: Multimodal Retrieval-Augmented Generator for Open Question
  Answering over Images and Text
MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text
Wenhu Chen
Hexiang Hu
Xi Chen
Pat Verga
William W. Cohen
RALM
102
160
0
06 Oct 2022
Vision+X: A Survey on Multimodal Learning in the Light of Data
Vision+X: A Survey on Multimodal Learning in the Light of Data
Ye Zhu
Yuehua Wu
N. Sebe
Yan Yan
119
19
0
05 Oct 2022
Affection: Learning Affective Explanations for Real-World Visual Data
Affection: Learning Affective Explanations for Real-World Visual Data
Panos Achlioptas
M. Ovsjanikov
Leonidas Guibas
Sergey Tulyakov
109
12
0
04 Oct 2022
Understanding Prior Bias and Choice Paralysis in Transformer-based
  Language Representation Models through Four Experimental Probes
Understanding Prior Bias and Choice Paralysis in Transformer-based Language Representation Models through Four Experimental Probes
Ke Shen
Mayank Kejriwal
89
4
0
03 Oct 2022
Enhancing Interpretability and Interactivity in Robot Manipulation: A
  Neurosymbolic Approach
Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach
Georgios Tziafas
Hamidreza Kasaei
LM&Ro
101
3
0
03 Oct 2022
Multimodal Analogical Reasoning over Knowledge Graphs
Multimodal Analogical Reasoning over Knowledge Graphs
Ningyu Zhang
Lei Li
Xiang Chen
Xiaozhuan Liang
Shumin Deng
Huajun Chen
149
28
0
01 Oct 2022
Construction and Evaluation of a Self-Attention Model for Semantic
  Understanding of Sentence-Final Particles
Construction and Evaluation of a Self-Attention Model for Semantic Understanding of Sentence-Final Particles
Shuhei Mandokoro
N. Oka
Akane Matsushima
Chie Fukada
Yuko Yoshimura
Koji Kawahara
Kazuaki Tanaka
54
1
0
01 Oct 2022
A Dual-Attention Learning Network with Word and Sentence Embedding for
  Medical Visual Question Answering
A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering
Xiaofei Huang
Hongfang Gong
MedIm
111
14
0
01 Oct 2022
Previous
123...212223...414243
Next