ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.02265
  4. Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
    SSLVLM
ArXiv (abs)PDFHTML

Papers citing "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"

50 / 2,118 papers shown
Title
CluMo: Cluster-based Modality Fusion Prompt for Continual Learning in
  Visual Question Answering
CluMo: Cluster-based Modality Fusion Prompt for Continual Learning in Visual Question Answering
Yuliang Cai
Mohammad Rostami
CLLVLMMLLM
128
4
0
21 Aug 2024
C${^2}$RL: Content and Context Representation Learning for Gloss-free
  Sign Language Translation and Retrieval
C2{^2}2RL: Content and Context Representation Learning for Gloss-free Sign Language Translation and Retrieval
Zhigang Chen
Benjia Zhou
Yiqing Huang
Jun Wan
Yibo Hu
Hailin Shi
Yanyan Liang
Zhen Lei
Du Zhang
VLMSLR
70
3
0
19 Aug 2024
Attribution Analysis Meets Model Editing: Advancing Knowledge Correction in Vision Language Models with VisEdit
Attribution Analysis Meets Model Editing: Advancing Knowledge Correction in Vision Language Models with VisEdit
Qizhou Chen
Taolin Zhang
Chengyu Wang
Xiaofeng He
Dakan Wang
Tingting Liu
KELM
177
4
0
19 Aug 2024
Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted
  Attack for Image-to-Text Models
Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models
Qingyuan Zeng
Zhenzhong Wang
Yiu-ming Cheung
Min Jiang
AAML
88
2
0
16 Aug 2024
Multi-Modal Dialogue State Tracking for Playing GuessWhich Game
Multi-Modal Dialogue State Tracking for Playing GuessWhich Game
Wei Pang
Ruixue Duan
Jinfu Yang
Ning Li
67
0
0
15 Aug 2024
Adaptive Learning of Consistency and Inconsistency Information for Fake
  News Detection
Adaptive Learning of Consistency and Inconsistency Information for Fake News Detection
Aohan Li
Jiaxin Chen
Xin Liao
Dengyong Zhang
56
1
0
15 Aug 2024
IIU: Independent Inference Units for Knowledge-based Visual Question
  Answering
IIU: Independent Inference Units for Knowledge-based Visual Question Answering
Yili Li
Jing Yu
Keke Gai
Gang Xiong
58
0
0
15 Aug 2024
Modality Invariant Multimodal Learning to Handle Missing Modalities: A
  Single-Branch Approach
Modality Invariant Multimodal Learning to Handle Missing Modalities: A Single-Branch Approach
Muhammad Saad Saeed
Shah Nawaz
Muhammad Zaigham Zaheer
Muhammad Haris Khan
Karthik Nandakumar
Muhammad Haroon Yousaf
Hassan Sajjad
Tom De Schepper
Markus Schedl
95
0
0
14 Aug 2024
Dual-Domain CLIP-Assisted Residual Optimization Perception Model for
  Metal Artifact Reduction
Dual-Domain CLIP-Assisted Residual Optimization Perception Model for Metal Artifact Reduction
Xinrui Zhang
Ailong Cai
Shaoyu Wang
Linyuan Wang
Zhizhong Zheng
Lei Li
Bin Yan
MedIm
78
0
0
14 Aug 2024
LLMI3D: MLLM-based 3D Perception from a Single 2D Image
LLMI3D: MLLM-based 3D Perception from a Single 2D Image
Fan Yang
Sicheng Zhao
Yanhao Zhang
Haoxiang Chen
Hui Chen
Wenbo Tang
Guiguang Ding
89
3
0
14 Aug 2024
ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval
ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval
Ruixiang Zhao
Jian Jia
Yan Li
Xuehan Bai
Quan Chen
Han Li
Peng Jiang
Xirong Li
80
0
0
06 Aug 2024
Towards Coarse-grained Visual Language Navigation Task Planning Enhanced
  by Event Knowledge Graph
Towards Coarse-grained Visual Language Navigation Task Planning Enhanced by Event Knowledge Graph
Zhao Kaichen
Song Yaoxian
Zhao Haiquan
Liu Haoyu
Li Tiefeng
Li Zhixu
81
0
0
05 Aug 2024
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
Dongyang Liu
Shitian Zhao
Le Zhuo
Weifeng Lin
Ping Luo
Xinyue Li
Qi Qin
Yu Qiao
Hongsheng Li
Peng Gao
MLLM
168
59
0
05 Aug 2024
A Novel Evaluation Framework for Image2Text Generation
A Novel Evaluation Framework for Image2Text Generation
Jia-Hong Huang
Hongyi Zhu
Yixian Shen
Stevan Rudinac
A. M. Pacces
Evangelos Kanoulas
77
9
0
03 Aug 2024
Actra: Optimized Transformer Architecture for Vision-Language-Action
  Models in Robot Learning
Actra: Optimized Transformer Architecture for Vision-Language-Action Models in Robot Learning
Yueen Ma
Dafeng Chi
Shiguang Wu
Yuecheng Liu
Yuzheng Zhuang
Jianye Hao
Irwin King
69
5
0
02 Aug 2024
An Efficient and Effective Transformer Decoder-Based Framework for
  Multi-Task Visual Grounding
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding
Wei Chen
Mahdieh Hatamian
Yu Wu
102
5
0
02 Aug 2024
MarvelOVD: Marrying Object Recognition and Vision-Language Models for
  Robust Open-Vocabulary Object Detection
MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection
Kuo Wang
Lechao Cheng
Weikai Chen
Pingping Zhang
Liang Lin
Fan Zhou
Guanbin Li
VLMObjD
76
3
0
31 Jul 2024
Advancing Vietnamese Visual Question Answering with Transformer and
  Convolutional Integration
Advancing Vietnamese Visual Question Answering with Transformer and Convolutional Integration
Ngoc Son Nguyen
Van Nguyen
Tung Le
ViT
91
1
0
30 Jul 2024
BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger
  Visual Cues
BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues
Sara Sarto
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
80
7
0
29 Jul 2024
FlexAttention for Efficient High-Resolution Vision-Language Models
FlexAttention for Efficient High-Resolution Vision-Language Models
Junyan Li
Delin Chen
Tianle Cai
Peihao Chen
Yining Hong
Zhenfang Chen
Yikang Shen
Chuang Gan
VLM
125
5
0
29 Jul 2024
Multi-modal Crowd Counting via Modal Emulation
Multi-modal Crowd Counting via Modal Emulation
Chenhao Wang
Xiaopeng Hong
Zhiheng Ma
Yupeng Wei
Yabin Wang
Xiaopeng Fan
58
1
0
28 Jul 2024
MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training
MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training
Biao Wu
Yutong Xie
Zeyu Zhang
Minh Hieu Phan
Qi Chen
Ling-Hao Chen
Qi Wu
LM&MA
112
0
0
28 Jul 2024
FakingRecipe: Detecting Fake News on Short Video Platforms from the
  Perspective of Creative Process
FakingRecipe: Detecting Fake News on Short Video Platforms from the Perspective of Creative Process
Yuyan Bu
Qiang Sheng
Juan Cao
Peng Qi
Danding Wang
Jintao Li
DiffM
77
14
0
23 Jul 2024
HAPFI: History-Aware Planning based on Fused Information
HAPFI: History-Aware Planning based on Fused Information
Sujin Jeon
Suyeon Shin
Byoung-Tak Zhang
63
0
0
23 Jul 2024
Spatiotemporal Graph Guided Multi-modal Network for Livestreaming
  Product Retrieval
Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval
Xiaowan Hu
Yiyi Chen
Yan Li
Minquan Wang
Haoqian Wang
Quan Chen
Han Li
Peng Jiang
AI4TS
80
0
0
23 Jul 2024
Chameleon: Images Are What You Need For Multimodal Learning Robust To
  Missing Modalities
Chameleon: Images Are What You Need For Multimodal Learning Robust To Missing Modalities
Muhammad Irzam Liaqat
Shah Nawaz
Muhammad Zaigham Zaheer
M. S. Saeed
Hassan Sajjad
Tom De Schepper
Karthik Nandakumar
Muhammad Haris Khan
96
1
0
23 Jul 2024
Knowledge Acquisition Disentanglement for Knowledge-based Visual
  Question Answering with Large Language Models
Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models
Wenbin An
Feng Tian
Jiahao Nie
Wenkai Shi
Haonan Lin
Yan Chen
Qianying Wang
Y. Wu
Guang Dai
Ping Chen
VLM
96
4
0
22 Jul 2024
Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models
Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models
Amir Mohammad Karimi Mamaghan
Samuele Papa
Karl Henrik Johansson
Stefan Bauer
Andrea Dittadi
OCL
176
9
0
22 Jul 2024
Benchmark Granularity and Model Robustness for Image-Text Retrieval
Benchmark Granularity and Model Robustness for Image-Text Retrieval
Mariya Hendriksen
Shuo Zhang
R. Reinanda
Mohamed Yahya
Edgar Meij
Maarten de Rijke
98
0
0
21 Jul 2024
Learning Visual Grounding from Generative Vision and Language Model
Learning Visual Grounding from Generative Vision and Language Model
Shijie Wang
Dahun Kim
A. Taalimi
Chen Sun
Weicheng Kuo
ObjD
113
7
0
18 Jul 2024
Multimodal Label Relevance Ranking via Reinforcement Learning
Multimodal Label Relevance Ranking via Reinforcement Learning
Taian Guo
Taolin Zhang
Haoqian Wu
Hanjun Li
Ruizhi Qiao
Xing Sun
OffRL
50
0
0
18 Jul 2024
Towards Zero-Shot Multimodal Machine Translation
Towards Zero-Shot Multimodal Machine Translation
Matthieu Futeral
Cordelia Schmid
Benoît Sagot
Rachel Bawden
114
4
0
18 Jul 2024
NavGPT-2: Unleashing Navigational Reasoning Capability for Large
  Vision-Language Models
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models
Gengze Zhou
Yicong Hong
Zun Wang
Xin Eric Wang
Qi Wu
LM&Ro
96
30
0
17 Jul 2024
ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via
  Modal Fusion Map
ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via Modal Fusion Map
Yilin Ye
Shishi Xiao
Xingchen Zeng
Wei Zeng
116
5
0
17 Jul 2024
Multimodal Reranking for Knowledge-Intensive Visual Question Answering
Multimodal Reranking for Knowledge-Intensive Visual Question Answering
Haoyang Wen
Honglei Zhuang
Hamed Zamani
Alexander Hauptmann
Michael Bendersky
62
1
0
17 Jul 2024
How and where does CLIP process negation?
How and where does CLIP process negation?
Vincent Quantmeyer
Pablo Mosteiro
Albert Gatt
CoGe
73
9
0
15 Jul 2024
IoT-LM: Large Multisensory Language Models for the Internet of Things
IoT-LM: Large Multisensory Language Models for the Internet of Things
Shentong Mo
Russ Salakhutdinov
Louis-Philippe Morency
Paul Pu Liang
MLLM
76
11
0
13 Jul 2024
Textual Query-Driven Mask Transformer for Domain Generalized
  Segmentation
Textual Query-Driven Mask Transformer for Domain Generalized Segmentation
Byeonghyun Pak
Byeongju Woo
Sunghwan Kim
Dae-Hwan Kim
Hoseong Kim
136
5
0
12 Jul 2024
ElasticAST: An Audio Spectrogram Transformer for All Length and
  Resolutions
ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions
Jiu Feng
Mehmet Hamza Erol
Joon Son Chung
Arda Senocak
72
1
0
11 Jul 2024
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language
  Model
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
Yatai Ji
Shilong Zhang
Jie Wu
Peize Sun
Weifeng Chen
Xuefeng Xiao
Sidi Yang
Yanting Yang
Ping Luo
VLM
80
4
0
10 Jul 2024
How to Make Cross Encoder a Good Teacher for Efficient Image-Text
  Retrieval?
How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?
Yuxin Chen
Zongyang Ma
Ziqi Zhang
Zhongang Qi
Chunfeng Yuan
Bing Li
Junfu Pu
Ying Shan
Xiaojuan Qi
Weiming Hu
67
2
0
10 Jul 2024
3D Vision and Language Pretraining with Large-Scale Synthetic Data
3D Vision and Language Pretraining with Large-Scale Synthetic Data
Dejie Yang
Zhu Xu
Wentao Mo
Qingchao Chen
Siyuan Huang
Yang Liu
89
6
0
08 Jul 2024
AI as a Tool for Fair Journalism: Case Studies from Malta
AI as a Tool for Fair Journalism: Case Studies from Malta
Dylan Seychell
Gabriel Hili
Jonathan Attard
Konstantinos Makantatis
38
3
0
08 Jul 2024
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual
  Contexts
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
Yijia Xiao
Edward Sun
Tianyu Liu
Wei Wang
LRM
84
42
0
06 Jul 2024
HEMM: Holistic Evaluation of Multimodal Foundation Models
HEMM: Holistic Evaluation of Multimodal Foundation Models
Paul Pu Liang
Akshay Goindani
Talha Chafekar
Leena Mathur
Haofei Yu
Ruslan Salakhutdinov
Louis-Philippe Morency
98
16
0
03 Jul 2024
Multi-Task Domain Adaptation for Language Grounding with 3D Objects
Multi-Task Domain Adaptation for Language Grounding with 3D Objects
Penglei Sun
Yaoxian Song
Xinglin Pan
Peijie Dong
Xiaofei Yang
Qiang-qiang Wang
Zhixu Li
Tiefeng Li
Xiaowen Chu
127
1
0
03 Jul 2024
Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective
Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective
Zhaotian Weng
Zijun Gao
Jerone Andrews
Jieyu Zhao
84
1
0
03 Jul 2024
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring
  Expression Segmentation
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation
Sayan Nag
Koustava Goswami
Srikrishna Karanam
107
4
0
02 Jul 2024
MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance
  Optimizations
MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance Optimizations
Akash Dutta
Ali Jannesari
81
0
0
02 Jul 2024
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding
  Evaluation
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation
Yuxuan Wang
Yijun Liu
Fei Yu
Chen Huang
Kexin Li
Zhiguo Wan
Wanxiang Che
VLMCoGe
81
5
0
01 Jul 2024
Previous
123456...414243
Next