ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.02265
  4. Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
    SSLVLM
ArXiv (abs)PDFHTML

Papers citing "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"

50 / 2,119 papers shown
Title
Text-to-Image Generation via Implicit Visual Guidance and Hypernetwork
Text-to-Image Generation via Implicit Visual Guidance and Hypernetwork
Xin Yuan
Zhe Lin
Jason Kuen
Jianming Zhang
John Collomosse
96
5
0
17 Aug 2022
Understanding Attention for Vision-and-Language Tasks
Understanding Attention for Vision-and-Language Tasks
Feiqi Cao
S. Han
Siqu Long
Changwei Xu
Josiah Poon
84
5
0
17 Aug 2022
What Artificial Neural Networks Can Tell Us About Human Language
  Acquisition
What Artificial Neural Networks Can Tell Us About Human Language Acquisition
Alex Warstadt
Samuel R. Bowman
94
120
0
17 Aug 2022
PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative
  Grounding
PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative Grounding
Zihan Ding
Zixiang Ding
Tianrui Hui
Junshi Huang
Xiaoming Wei
Xiaolin K. Wei
Si Liu
96
14
0
11 Aug 2022
ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal
  Fashion Design
ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal Fashion Design
Xujie Zhang
Yuyang Sha
Michael C. Kampffmeyer
Zhenyu Xie
Zequn Jie
Chengwen Huang
Jianqing Peng
Xiaodan Liang
97
20
0
11 Aug 2022
Self-supervised Multi-modal Training from Uncurated Image and Reports
  Enables Zero-shot Oversight Artificial Intelligence in Radiology
Self-supervised Multi-modal Training from Uncurated Image and Reports Enables Zero-shot Oversight Artificial Intelligence in Radiology
Sangjoon Park
Eunha Lee
Kyung Sook Shin
Jeonghyeon Lee
Jong Chul Ye
61
2
0
10 Aug 2022
Aesthetic Attributes Assessment of Images with AMANv2 and DPC-CaptionsV2
Aesthetic Attributes Assessment of Images with AMANv2 and DPC-CaptionsV2
Xinghui Zhou
Xin Jin
Jianwen Lv
Heng Huang
Ming Mao
Shuai Cui
CoGe
58
0
0
09 Aug 2022
GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language
  Pre-training
GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training
Jaeseok Byun
Taebaek Hwang
Jianlong Fu
Taesup Moon
VLM
95
11
0
08 Aug 2022
LATTE: LAnguage Trajectory TransformEr
LATTE: LAnguage Trajectory TransformEr
A. Bucker
Luis F. C. Figueredo
Sami Haddadin
Ashish Kapoor
Shuang Ma
Sai H. Vemprala
Rogerio Bonatti
LM&Ro
146
59
0
04 Aug 2022
Prompt Tuning for Generative Multimodal Pretrained Models
Prompt Tuning for Generative Multimodal Pretrained Models
Han Yang
Junyang Lin
An Yang
Peng Wang
Chang Zhou
Hongxia Yang
VLMLRMVPVLM
91
31
0
04 Aug 2022
Fine-Grained Semantically Aligned Vision-Language Pre-Training
Fine-Grained Semantically Aligned Vision-Language Pre-Training
Juncheng Li
Xin He
Longhui Wei
Long Qian
Linchao Zhu
Lingxi Xie
Yueting Zhuang
Qi Tian
Siliang Tang
VLM
106
80
0
04 Aug 2022
Masked Vision and Language Modeling for Multi-modal Representation
  Learning
Masked Vision and Language Modeling for Multi-modal Representation Learning
Gukyeong Kwon
Zhaowei Cai
Avinash Ravichandran
Erhan Bas
Rahul Bhotika
Stefano Soatto
92
68
0
03 Aug 2022
Two-Stream Transformer Architecture for Long Video Understanding
Two-Stream Transformer Architecture for Long Video Understanding
Edward Fish
Jon Weinbren
Andrew Gilbert
ViT
52
6
0
02 Aug 2022
Video Question Answering with Iterative Video-Text Co-Tokenization
Video Question Answering with Iterative Video-Text Co-Tokenization
A. Piergiovanni
K. Morton
Weicheng Kuo
Michael S. Ryoo
A. Angelova
104
18
0
01 Aug 2022
Generative Bias for Robust Visual Question Answering
Generative Bias for Robust Visual Question Answering
Jae-Won Cho
Dong-Jin Kim
H. Ryu
In So Kweon
OODCML
109
20
0
01 Aug 2022
Augmenting Vision Language Pretraining by Learning Codebook with Visual
  Semantics
Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics
Xiaoyuan Guo
Jiali Duan
C.-C. Jay Kuo
J. Gichoya
Imon Banerjee
VLM
53
1
0
31 Jul 2022
ALADIN: Distilling Fine-grained Alignment Scores for Efficient
  Image-Text Matching and Retrieval
ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval
Nicola Messina
Matteo Stefanini
Marcella Cornia
Lorenzo Baraldi
Fabrizio Falchi
Giuseppe Amato
Rita Cucchiara
VLM
42
22
0
29 Jul 2022
DoRO: Disambiguation of referred object for embodied agents
DoRO: Disambiguation of referred object for embodied agents
Pradip Pramanick
Chayan Sarkar
S. Paul
R. Roychoudhury
Brojeshwar Bhowmick
LM&Ro
51
14
0
28 Jul 2022
SiRi: A Simple Selective Retraining Mechanism for Transformer-based
  Visual Grounding
SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding
Mengxue Qu
Yu Wu
Wu Liu
Qiqi Gong
Xiaodan Liang
Olga Russakovsky
Yao Zhao
Yunchao Wei
ObjD
50
24
0
27 Jul 2022
Uncertainty-based Visual Question Answering: Estimating Semantic
  Inconsistency between Image and Knowledge Base
Uncertainty-based Visual Question Answering: Estimating Semantic Inconsistency between Image and Knowledge Base
Jinyeong Chae
Jihie Kim
64
2
0
27 Jul 2022
LaKo: Knowledge-driven Visual Question Answering via Late
  Knowledge-to-Text Injection
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
Zhuo Chen
Yufen Huang
Jiaoyan Chen
Yuxia Geng
Yin Fang
Jeff Z. Pan
Ningyu Zhang
Wen Zhang
95
38
0
26 Jul 2022
Learning Visual Representation from Modality-Shared Contrastive
  Language-Image Pre-training
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
Haoxuan You
Luowei Zhou
Bin Xiao
Noel Codella
Yu Cheng
Ruochen Xu
Shih-Fu Chang
Lu Yuan
CLIPVLM
89
48
0
26 Jul 2022
Multi-Attention Network for Compressed Video Referring Object
  Segmentation
Multi-Attention Network for Compressed Video Referring Object Segmentation
Weidong Chen
Dexiang Hong
Yuankai Qi
Zhenjun Han
Shuhui Wang
Laiyun Qing
Qingming Huang
Guorong Li
VOS
58
40
0
26 Jul 2022
Generalizable Patch-Based Neural Rendering
Generalizable Patch-Based Neural Rendering
M. Suhail
Carlos Esteves
Leonid Sigal
A. Makadia
158
106
0
21 Jul 2022
Semantic-aware Modular Capsule Routing for Visual Question Answering
Semantic-aware Modular Capsule Routing for Visual Question Answering
Yudong Han
Jianhua Yin
Jianlong Wu
Yin-wei Wei
Liqiang Nie
68
8
0
21 Jul 2022
LocVTP: Video-Text Pre-training for Temporal Localization
LocVTP: Video-Text Pre-training for Temporal Localization
Meng Cao
Tianyu Yang
Junwu Weng
Can Zhang
Jue Wang
Yuexian Zou
103
65
0
21 Jul 2022
Temporal and cross-modal attention for audio-visual zero-shot learning
Temporal and cross-modal attention for audio-visual zero-shot learning
Otniel-Bogdan Mercea
Thomas Hummel
A. Sophia Koepke
Zeynep Akata
102
27
0
20 Jul 2022
Explicit Image Caption Editing
Explicit Image Caption Editing
Zhen Wang
Long Chen
Wenbo Ma
G. Han
Yulei Niu
Jian Shao
Jun Xiao
72
12
0
20 Jul 2022
Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification
Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification
Renrui Zhang
Zhang Wei
Rongyao Fang
Peng Gao
Kunchang Li
Jifeng Dai
Yu Qiao
Hongsheng Li
VLM
145
321
0
19 Jul 2022
Don't Stop Learning: Towards Continual Learning for the CLIP Model
Don't Stop Learning: Towards Continual Learning for the CLIP Model
Yuxuan Ding
Lingqiao Liu
Chunna Tian
Jingyuan Yang
Haoxuan Ding
CLLVLMKELM
82
55
0
19 Jul 2022
Target-Driven Structured Transformer Planner for Vision-Language
  Navigation
Target-Driven Structured Transformer Planner for Vision-Language Navigation
Yusheng Zhao
Jinyu Chen
Chen Gao
Wenguan Wang
Lirong Yang
Haibing Ren
Huaxia Xia
Si Liu
LM&Ro
103
60
0
19 Jul 2022
Exploiting Unlabeled Data with Vision and Language Models for Object
  Detection
Exploiting Unlabeled Data with Vision and Language Models for Object Detection
Shiyu Zhao
Zhixing Zhang
S. Schulter
Long Zhao
Vijay Kumar B.G
Anastasis Stathopoulos
Manmohan Chandraker
Dimitris N. Metaxas
VLMObjD
89
102
0
18 Jul 2022
Zero-Shot Temporal Action Detection via Vision-Language Prompting
Zero-Shot Temporal Action Detection via Vision-Language Prompting
Sauradip Nag
Xiatian Zhu
Yi-Zhe Song
Tao Xiang
VLM
84
68
0
17 Jul 2022
FashionViL: Fashion-Focused Vision-and-Language Representation Learning
FashionViL: Fashion-Focused Vision-and-Language Representation Learning
Xiaoping Han
Licheng Yu
Xiatian Zhu
Li Zhang
Yi-Zhe Song
Tao Xiang
AI4TS
54
49
0
17 Jul 2022
Learning Granularity-Unified Representations for Text-to-Image Person
  Re-identification
Learning Granularity-Unified Representations for Text-to-Image Person Re-identification
Zhiyin Shao
Xinyu Zhang
Meng Fang
Zhi-hao Lin
Jian Wang
Changxing Ding
78
110
0
16 Jul 2022
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text
  Retrieval
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Ming Yan
Ji Zhang
Rongrong Ji
CLIPVLM
111
293
0
15 Jul 2022
TRIE++: Towards End-to-End Information Extraction from Visually Rich
  Documents
TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents
Zhanzhan Cheng
Peng Zhang
Can Li
Qiao Liang
Yunlu Xu
Pengfei Li
Shiliang Pu
Yi Niu
Fei Wu
52
10
0
14 Jul 2022
Global-local Motion Transformer for Unsupervised Skeleton-based Action
  Learning
Global-local Motion Transformer for Unsupervised Skeleton-based Action Learning
Boeun Kim
H. Chang
Jungho Kim
J. Choi
ViT
87
52
0
13 Jul 2022
Learning to Estimate External Forces of Human Motion in Video
Learning to Estimate External Forces of Human Motion in Video
Nathan Louis
Tylan N. Templin
Travis D. Eliason
D. Nicolella
Jason J. Corso
3DH
130
5
0
12 Jul 2022
Inner Monologue: Embodied Reasoning through Planning with Language
  Models
Inner Monologue: Embodied Reasoning through Planning with Language Models
Wenlong Huang
F. Xia
Ted Xiao
Harris Chan
Jacky Liang
...
Tomas Jackson
Linda Luu
Sergey Levine
Karol Hausman
Brian Ichter
LLMAGLM&RoLRM
214
927
0
12 Jul 2022
Video Graph Transformer for Video Question Answering
Video Graph Transformer for Video Question Answering
Junbin Xiao
Pan Zhou
Tat-Seng Chua
Shuicheng Yan
ViT
242
78
0
12 Jul 2022
IDEA: Increasing Text Diversity via Online Multi-Label Recognition for
  Vision-Language Pre-training
IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training
Xinyu Huang
Youcai Zhang
Ying Cheng
Weiwei Tian
Ruiwei Zhao
Rui Feng
Yuejie Zhang
Yaqian Li
Yandong Guo
Xiao-Yong Zhang
VLM
86
14
0
12 Jul 2022
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval
Jinbin Bai
Chunhui Liu
Feiyue Ni
Haofan Wang
Mengying Hu
Xiaofeng Guo
Lele Cheng
107
11
0
11 Jul 2022
Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion
  Recognition
Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion Recognition
Zihan Zhao
Yanfeng Wang
Yu Wang
57
35
0
11 Jul 2022
Towards Multimodal Vision-Language Models Generating Non-Generic Text
Towards Multimodal Vision-Language Models Generating Non-Generic Text
Wes Robbins
Zanyar Zohourianshahzadi
Jugal Kalita
56
1
0
09 Jul 2022
Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge
  Transfer
Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer
Su He
Taian Guo
Tao Dai
Ruizhi Qiao
Bo Ren
Shutao Xia
VLM
154
51
0
05 Jul 2022
Toward Explainable and Fine-Grained 3D Grounding through Referring
  Textual Phrases
Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases
Zhihao Yuan
Xu Yan
Zhuo Li
Xuhao Li
Yao Guo
Shuguang Cui
Zhen Li
92
17
0
05 Jul 2022
Vision-and-Language Pretraining
Vision-and-Language Pretraining
Thong Nguyen
Cong-Duy Nguyen
Xiaobao Wu
See-Kiong Ng
Anh Tuan Luu
VLMCLIP
74
2
0
05 Jul 2022
Dynamic Contrastive Distillation for Image-Text Retrieval
Dynamic Contrastive Distillation for Image-Text Retrieval
Jun Rao
Liang Ding
Shuhan Qi
Meng Fang
Yang Liu
Liqiong Shen
Dacheng Tao
VLM
114
33
0
04 Jul 2022
DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning
DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning
Zhuo Chen
Yufen Huang
Jiaoyan Chen
Yuxia Geng
Wen Zhang
Yin Fang
Jeff Z. Pan
Huajun Chen
VLM
137
66
0
04 Jul 2022
Previous
123...232425...414243
Next