ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.02265
  4. Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
    SSLVLM
ArXiv (abs)PDFHTML

Papers citing "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"

50 / 2,119 papers shown
Title
Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
  Fusion for Multimodal Affect Recognition
Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition
Yaoting Wang
Yuanchao Li
Paul Pu Liang
Louis-Philippe Morency
P. Bell
Catherine Lai
CVBM
85
8
0
23 May 2023
Preconditioned Visual Language Inference with Weak Supervision
Preconditioned Visual Language Inference with Weak Supervision
Ehsan Qasemi
Amani Maina-Kilaas
Devadutta Dash
Khalid Alsaggaf
Muhao Chen
87
0
0
22 May 2023
GNCformer Enhanced Self-attention for Automatic Speech Recognition
GNCformer Enhanced Self-attention for Automatic Speech Recognition
Junlong Li
Z. Duan
S. Li
X. Yu
G. Yang
53
1
0
22 May 2023
Has It All Been Solved? Open NLP Research Questions Not Solved by Large
  Language Models
Has It All Been Solved? Open NLP Research Questions Not Solved by Large Language Models
Oana Ignat
Zhijing Jin
Artem Abzaliev
Laura Biester
Santiago Castro
...
Verónica Pérez-Rosas
Siqi Shen
Zekun Wang
Winston Wu
Rada Mihalcea
LRM
143
6
0
21 May 2023
Brain encoding models based on multimodal transformers can transfer
  across language and vision
Brain encoding models based on multimodal transformers can transfer across language and vision
Jerry Tang
Meng Du
Vy A. Vo
Vasudev Lal
Alexander G. Huth
98
33
0
20 May 2023
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner
  and Dense Captioner
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
Zikang Liu
Sihan Chen
Longteng Guo
Handong Li
Xingjian He
Qingbin Liu
82
1
0
19 May 2023
Generating Visual Spatial Description via Holistic 3D Scene
  Understanding
Generating Visual Spatial Description via Holistic 3D Scene Understanding
Yu Zhao
Hao Fei
Wei Ji
Jianguo Wei
Meishan Zhang
Hao Fei
Tat-Seng Chua
65
33
0
19 May 2023
Information Screening whilst Exploiting! Multimodal Relation Extraction
  with Feature Denoising and Multimodal Topic Modeling
Information Screening whilst Exploiting! Multimodal Relation Extraction with Feature Denoising and Multimodal Topic Modeling
Shengqiong Wu
Hao Fei
Yixin Cao
Lidong Bing
Tat-Seng Chua
93
35
0
19 May 2023
Speech-Text Dialog Pre-training for Spoken Dialog Understanding with
  Explicit Cross-Modal Alignment
Speech-Text Dialog Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment
Tianshu Yu
Haoyu Gao
Ting-En Lin
Min Yang
Yuchuan Wu
Wen-Cheng Ma
Chao Wang
Fei Huang
Yongbin Li
68
23
0
19 May 2023
TreePrompt: Learning to Compose Tree Prompts for Explainable Visual
  Grounding
TreePrompt: Learning to Compose Tree Prompts for Explainable Visual Grounding
Chenchi Zhang
Jun Xiao
Lei Chen
Jian Shao
Long Chen
VLMLRM
87
2
0
19 May 2023
MALM: Mask Augmentation based Local Matching for Food-Recipe Retrieval
MALM: Mask Augmentation based Local Matching for Food-Recipe Retrieval
Bhanu Prakash Voutharoja
Peng Wang
Lei Wang
Vivienne Guan
71
6
0
18 May 2023
ONE-PEACE: Exploring One General Representation Model Toward Unlimited
  Modalities
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang
Shijie Wang
Junyang Lin
Shuai Bai
Xiaohuan Zhou
Jingren Zhou
Xinggang Wang
Chang Zhou
VLMMLLMObjD
151
122
0
18 May 2023
XFormer: Fast and Accurate Monocular 3D Body Capture
XFormer: Fast and Accurate Monocular 3D Body Capture
Lihui Qian
Xintong Han
Faqiang Wang
Hongyu Liu
Haoye Dong
Zhiwen Li
Huawei Wei
Zhe Lin
Cheng-Bin Jin
3DH
76
1
0
18 May 2023
Inspecting the Geographical Representativeness of Images from
  Text-to-Image Models
Inspecting the Geographical Representativeness of Images from Text-to-Image Models
Aparna Basu
R. Venkatesh Babu
Danish Pruthi
DiffM
120
40
0
18 May 2023
Probing the Role of Positional Information in Vision-Language Models
Probing the Role of Positional Information in Vision-Language Models
Philipp J. Rösch
Jindrich Libovický
65
8
0
17 May 2023
Mobile User Interface Element Detection Via Adaptively Prompt Tuning
Mobile User Interface Element Detection Via Adaptively Prompt Tuning
Zhangxuan Gu
Zhuoer Xu
Haoxing Chen
Jun Lan
Changhua Meng
Weiqiang Wang
54
4
0
16 May 2023
Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with
  Foundation Models
Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation Models
Zhimin Chen
Longlong Jing
Yingwei Li
Bing Li
118
34
0
15 May 2023
Continual Multimodal Knowledge Graph Construction
Continual Multimodal Knowledge Graph Construction
Xiang Chen
Jintian Zhang
Xiaohan Wang
Ningyu Zhang
Tongtong Wu
Luo Si
Yongheng Wang
Huajun Chen
KELMCLL
95
15
0
15 May 2023
Semantic Composition in Visually Grounded Language Models
Semantic Composition in Visually Grounded Language Models
Rohan Pandey
CoGe
91
1
0
15 May 2023
RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training
RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training
Chulun Zhou
Yunlong Liang
Fandong Meng
Jinan Xu
Jinsong Su
Jie Zhou
VLM
71
4
0
13 May 2023
Learning the Visualness of Text Using Large Vision-Language Models
Learning the Visualness of Text Using Large Vision-Language Models
Gaurav Verma
Ryan Rossi
Chris Tensmeyer
Jiuxiang Gu
A. Nenkova
VLM
71
0
0
11 May 2023
Image-to-Text Translation for Interactive Image Recognition: A
  Comparative User Study with Non-Expert Users
Image-to-Text Translation for Interactive Image Recognition: A Comparative User Study with Non-Expert Users
Wataru Kawabe
Yusuke Sugano
VLM
70
2
0
11 May 2023
Combo of Thinking and Observing for Outside-Knowledge VQA
Combo of Thinking and Observing for Outside-Knowledge VQA
Q. Si
Yuchen Mo
Zheng Lin
Huishan Ji
Weiping Wang
95
14
0
10 May 2023
InfoMetIC: An Informative Metric for Reference-free Image Caption
  Evaluation
InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation
Anwen Hu
Shizhe Chen
Liang Zhang
Qin Jin
63
22
0
10 May 2023
Vision-Language Models in Remote Sensing: Current Progress and Future
  Trends
Vision-Language Models in Remote Sensing: Current Progress and Future Trends
Xiang Li
Congcong Wen
Yuan Hu
Zhenghang Yuan
Xiao Xiang Zhu
VLM
89
82
0
09 May 2023
A Review of Vision-Language Models and their Performance on the Hateful
  Memes Challenge
A Review of Vision-Language Models and their Performance on the Hateful Memes Challenge
Bryan Zhao
Andrew Zhang
Blake Watson
Gillian Kearney
Isaac Dale
VLM
38
4
0
09 May 2023
A Multi-Modal Context Reasoning Approach for Conditional Inference on
  Joint Textual and Visual Clues
A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues
Yunxin Li
Baotian Hu
Xinyu Chen
Yuxin Ding
Lin Ma
Min Zhang
LRM
93
15
0
08 May 2023
Scene Text Recognition with Image-Text Matching-guided Dictionary
Scene Text Recognition with Image-Text Matching-guided Dictionary
Jiajun Wei
Hongjian Zhan
X. Tu
Yue Lu
Umapada Pal
VLM
48
0
0
08 May 2023
IIITD-20K: Dense captioning for Text-Image ReID
IIITD-20K: Dense captioning for Text-Image ReID
A. V. Subramanyam
N. Sundararajan
Vibhu Dubey
Brejesh Lall
VLM
30
3
0
08 May 2023
Vision Language Pre-training by Contrastive Learning with Cross-Modal
  Similarity Regulation
Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation
Chaoya Jiang
Wei Ye
Haiyang Xu
Miang yan
Shikun Zhang
Jie Zhang
Fei Huang
VLM
92
16
0
08 May 2023
OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual
  Question Answering in Vietnamese
OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in Vietnamese
Nghia Hieu Nguyen
Duong T.D. Vo
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
84
20
0
07 May 2023
X-LLM: Bootstrapping Advanced Large Language Models by Treating
  Multi-Modalities as Foreign Languages
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
Feilong Chen
Minglun Han
Haozhi Zhao
Qingyang Zhang
Jing Shi
Shuang Xu
Bo Xu
MLLM
159
126
0
07 May 2023
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal
  Structured Representations
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations
Yufen Huang
Jiji Tang
Zhuo Chen
Rongsheng Zhang
Xinfeng Zhang
...
Zeng Zhao
Zhou Zhao
Tangjie Lv
Zhipeng Hu
Wen Zhang
VLM
125
25
0
06 May 2023
Personalize Segment Anything Model with One Shot
Personalize Segment Anything Model with One Shot
Renrui Zhang
Zhengkai Jiang
Ziyu Guo
Shilin Yan
Junting Pan
Xianzheng Ma
Hao Dong
Peng Gao
Hongsheng Li
MLLMVLM
124
219
0
04 May 2023
ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning
  over Untrimmed Videos
ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos
Zhou Yu
Lixiang Zheng
Zhou Zhao
A. Fedoseev
Jianping Fan
Kui Ren
Jun Yu
CoGe
125
16
0
04 May 2023
Making the Most of What You Have: Adapting Pre-trained Visual Language
  Models in the Low-data Regime
Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime
Chuhan Zhang
Antoine Miech
Jiajun Shen
Jean-Baptiste Alayrac
Pauline Luc
VLMVPVLM
90
2
0
03 May 2023
A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from
  Linguistically Complex Text
A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from Linguistically Complex Text
Yunxin Li
Baotian Hu
Yuxin Ding
Lin Ma
Hao Fei
88
5
0
03 May 2023
VPGTrans: Transfer Visual Prompt Generator across LLMs
VPGTrans: Transfer Visual Prompt Generator across LLMs
Ao Zhang
Hao Fei
Yuan Yao
Wei Ji
Li Li
Zhiyuan Liu
Tat-Seng Chua
MLLMVLM
92
89
0
02 May 2023
In-Context Learning Unlocked for Diffusion Models
In-Context Learning Unlocked for Diffusion Models
Zhendong Wang
Yi Ding
Yadong Lu
Yelong Shen
Pengcheng He
Weizhu Chen
Zhangyang Wang
Mingyuan Zhou
VLMDiffM
150
78
0
01 May 2023
ArK: Augmented Reality with Knowledge Interactive Emergent Ability
ArK: Augmented Reality with Knowledge Interactive Emergent Ability
Qiuyuan Huang
Jinho Park
Abhinav Gupta
Paul N. Bennett
Ran Gong
...
Baolin Peng
O. Mohammed
C. Pal
Yejin Choi
Jianfeng Gao
122
6
0
01 May 2023
Multimodal Graph Transformer for Multimodal Question Answering
Multimodal Graph Transformer for Multimodal Question Answering
Xuehai He
Xin Eric Wang
103
9
0
30 Apr 2023
Click-Feedback Retrieval
Click-Feedback Retrieval
Zeyu Wang
Yuehua Wu
59
0
0
28 Apr 2023
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Peng Gao
Jiaming Han
Renrui Zhang
Ziyi Lin
Shijie Geng
...
Pan Lu
Conghui He
Xiangyu Yue
Hongsheng Li
Yu Qiao
MLLM
129
589
0
28 Apr 2023
An Empirical Study of Multimodal Model Merging
An Empirical Study of Multimodal Model Merging
Yi-Lin Sung
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Joey Tianyi Zhou
Lijuan Wang
MoMe
125
42
0
28 Apr 2023
Multimodal Grounding for Embodied AI via Augmented Reality Headsets for
  Natural Language Driven Task Planning
Multimodal Grounding for Embodied AI via Augmented Reality Headsets for Natural Language Driven Task Planning
Selma Wanna
Fabian Parra
R. Valner
Karl Kruusamäe
Mitch Pryor
LM&Ro
77
2
0
26 Apr 2023
Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables
Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables
Matthias Urban
Carsten Binnig
81
5
0
26 Apr 2023
From Association to Generation: Text-only Captioning by Unsupervised
  Cross-modal Mapping
From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping
Junyan Wang
Ming Yan
Yi Zhang
Jitao Sang
CLIPVLM
74
9
0
26 Apr 2023
Sample-Specific Debiasing for Better Image-Text Models
Sample-Specific Debiasing for Better Image-Text Models
Peiqi Wang
Yingcheng Liu
Ching-Yun Ko
W. Wells
Seth Berkowitz
Steven Horng
Polina Golland
SSLMedIm
121
1
0
25 Apr 2023
Hypernymization of named entity-rich captions for grounding-based
  multi-modal pretraining
Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining
Giacomo Nebbia
Adriana Kovashka
110
0
0
25 Apr 2023
Rethinking Benchmarks for Cross-modal Image-text Retrieval
Rethinking Benchmarks for Cross-modal Image-text Retrieval
Wei Chen
Linli Yao
Qin Jin
VLM
109
17
0
21 Apr 2023
Previous
123...151617...414243
Next