Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.02265
Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"
50 / 2,093 papers shown
Title
FakingRecipe: Detecting Fake News on Short Video Platforms from the Perspective of Creative Process
Yuyan Bu
Qiang Sheng
Juan Cao
Peng Qi
Danding Wang
Jintao Li
DiffM
41
8
0
23 Jul 2024
HAPFI: History-Aware Planning based on Fused Information
Sujin Jeon
Suyeon Shin
Byoung-Tak Zhang
39
0
0
23 Jul 2024
Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval
Xiaowan Hu
Yiyi Chen
Yan Li
Minquan Wang
Haoqian Wang
Quan Chen
Han Li
Peng Jiang
AI4TS
37
0
0
23 Jul 2024
Chameleon: Images Are What You Need For Multimodal Learning Robust To Missing Modalities
Muhammad Irzam Liaqat
Shah Nawaz
Muhammad Zaigham Zaheer
M. S. Saeed
Hassan Sajjad
Tom De Schepper
Karthik Nandakumar
Muhammad Haris Khan
30
1
0
23 Jul 2024
Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models
Wenbin An
Feng Tian
Jiahao Nie
Wenkai Shi
Haonan Lin
Yan Chen
Qianying Wang
Y. Wu
Guang Dai
Ping Chen
VLM
53
4
0
22 Jul 2024
Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models
Amir Mohammad Karimi Mamaghan
Samuele Papa
Karl Henrik Johansson
Stefan Bauer
Andrea Dittadi
OCL
48
5
0
22 Jul 2024
Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective
Mariya Hendriksen
Shuo Zhang
R. Reinanda
Mohamed Yahya
Edgar Meij
Maarten de Rijke
59
0
0
21 Jul 2024
Learning Visual Grounding from Generative Vision and Language Model
Shijie Wang
Dahun Kim
A. Taalimi
Chen Sun
Weicheng Kuo
ObjD
36
5
0
18 Jul 2024
Multimodal Label Relevance Ranking via Reinforcement Learning
Taian Guo
Taolin Zhang
Haoqian Wu
Hanjun Li
Ruizhi Qiao
Xing Sun
OffRL
24
0
0
18 Jul 2024
Towards Zero-Shot Multimodal Machine Translation
Matthieu Futeral
Cordelia Schmid
Benoît Sagot
Rachel Bawden
40
3
0
18 Jul 2024
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models
Gengze Zhou
Yicong Hong
Zun Wang
Xin Eric Wang
Qi Wu
LM&Ro
45
19
0
17 Jul 2024
ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via Modal Fusion Map
Yilin Ye
Shishi Xiao
Xingchen Zeng
Wei Zeng
46
3
0
17 Jul 2024
Multimodal Reranking for Knowledge-Intensive Visual Question Answering
Haoyang Wen
Honglei Zhuang
Hamed Zamani
Alexander Hauptmann
Michael Bendersky
42
0
0
17 Jul 2024
How and where does CLIP process negation?
Vincent Quantmeyer
Pablo Mosteiro
Albert Gatt
CoGe
29
6
0
15 Jul 2024
IoT-LM: Large Multisensory Language Models for the Internet of Things
Shentong Mo
Russ Salakhutdinov
Louis-Philippe Morency
Paul Pu Liang
MLLM
32
7
0
13 Jul 2024
Textual Query-Driven Mask Transformer for Domain Generalized Segmentation
Byeonghyun Pak
Byeongju Woo
Sunghwan Kim
Dae-Hwan Kim
Hoseong Kim
52
3
0
12 Jul 2024
ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions
Jiu Feng
Mehmet Hamza Erol
Joon Son Chung
Arda Senocak
33
1
0
11 Jul 2024
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
Yatai Ji
Shilong Zhang
Jie Wu
Peize Sun
Weifeng Chen
Xuefeng Xiao
Sidi Yang
Yanting Yang
Ping Luo
VLM
48
3
0
10 Jul 2024
How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?
Yuxin Chen
Zongyang Ma
Ziqi Zhang
Zhongang Qi
Chunfeng Yuan
Bing Li
Junfu Pu
Ying Shan
Xiaojuan Qi
Weiming Hu
41
2
0
10 Jul 2024
3D Vision and Language Pretraining with Large-Scale Synthetic Data
Dejie Yang
Zhu Xu
Wentao Mo
Qingchao Chen
Siyuan Huang
Yang Liu
24
5
0
08 Jul 2024
AI as a Tool for Fair Journalism: Case Studies from Malta
Dylan Seychell
Gabriel Hili
Jonathan Attard
Konstantinos Makantatis
21
3
0
08 Jul 2024
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
Yijia Xiao
Edward Sun
Tianyu Liu
Wei Wang
LRM
35
27
0
06 Jul 2024
HEMM: Holistic Evaluation of Multimodal Foundation Models
Paul Pu Liang
Akshay Goindani
Talha Chafekar
Leena Mathur
Haofei Yu
Ruslan Salakhutdinov
Louis-Philippe Morency
41
10
0
03 Jul 2024
Multi-Task Domain Adaptation for Language Grounding with 3D Objects
Penglei Sun
Yaoxian Song
Xinglin Pan
Peijie Dong
Xiaofei Yang
Qiang-qiang Wang
Zhixu Li
Tiefeng Li
Xiaowen Chu
70
1
0
03 Jul 2024
Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective
Zhaotian Weng
Zijun Gao
Jerone Andrews
Jieyu Zhao
33
0
0
03 Jul 2024
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation
Sayan Nag
Koustava Goswami
Srikrishna Karanam
50
2
0
02 Jul 2024
MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance Optimizations
Akash Dutta
Ali Jannesari
38
0
0
02 Jul 2024
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation
Yuxuan Wang
Yijun Liu
Fei Yu
Chen Huang
Kexin Li
Zhiguo Wan
Wanxiang Che
VLM
CoGe
35
5
0
01 Jul 2024
The Odyssey of Commonsense Causality: From Foundational Benchmarks to Cutting-Edge Reasoning
Shaobo Cui
Zhijing Jin
Bernhard Schölkopf
Boi Faltings
CML
LRM
47
4
0
27 Jun 2024
Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation
Malvina Nikandrou
Georgios Pantazopoulos
Ioannis Konstas
Alessandro Suglia
32
0
0
27 Jun 2024
Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions
Minghan Li
Heng Li
Zhi-Qi Cheng
Yifei Dong
Yuxuan Zhou
Jun-Yan He
Qi Dai
Teruko Mitamura
Alexander G. Hauptmann
LM&Ro
43
4
0
27 Jun 2024
Improving the Consistency in Cross-Lingual Cross-Modal Retrieval with 1-to-K Contrastive Learning
Zhijie Nie
Richong Zhang
Zhangchi Feng
Hailang Huang
Xudong Liu
40
1
0
26 Jun 2024
ScanFormer: Referring Expression Comprehension by Iteratively Scanning
Wei Su
Peihan Miao
Huanzhang Dou
Xi Li
ObjD
50
7
0
26 Jun 2024
Towards a Science Exocortex
Kevin G. Yager
80
0
0
24 Jun 2024
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
Michal Golovanevsky
William Rudman
Vedant Palit
Ritambhara Singh
Carsten Eickhoff
33
1
0
24 Jun 2024
Multi-Scale Temporal Difference Transformer for Video-Text Retrieval
Ni Wang
Dongliang Liao
Xing Xu
38
0
0
23 Jun 2024
Towards Natural Language-Driven Assembly Using Foundation Models
O. Joglekar
Tal Lancewicki
Shir Kozlovsky
Vladimir Tchuiev
Zohar Feldman
Dotan Di Castro
LM&Ro
39
0
0
23 Jun 2024
Multi-Granularity and Multi-modal Feature Interaction Approach for Text Video Retrieval
Wenjun Li
Shudong Wang
Dong Zhao
Shenghui Xu
Zhaoming Pan
Zhimin Zhang
34
0
0
21 Jun 2024
Composing Object Relations and Attributes for Image-Text Matching
Khoi Pham
Chuong Huynh
Ser-Nam Lim
Abhinav Shrivastava
CoGe
44
4
0
17 Jun 2024
They're All Doctors: Synthesizing Diverse Counterfactuals to Mitigate Associative Bias
Salma Abdel Magid
Jui-Hsien Wang
Kushal Kafle
Hanspeter Pfister
44
1
0
17 Jun 2024
Open-Vocabulary X-ray Prohibited Item Detection via Fine-tuning CLIP
Shuyang Lin
Tong Jia
Hao Wang
Bowen Ma
Mingyuan Li
Dongyue Chen
VLM
ObjD
41
0
0
16 Jun 2024
MDeRainNet: An Efficient Neural Network for Rain Streak Removal from Macro-pixel Images
Tao Yan
Weijiang He
Chenglong Wang
Xiangjie Zhu
Yinghui Wang
Rynson W. H. Lau
41
0
0
15 Jun 2024
Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection
Shruti Palaskar
Oggi Rudovic
Sameer Dharur
Florian Pesce
G. Krishna
Aswin Sivaraman
Jack Berkowitz
Ahmed Hussen Abdelaziz
Saurabh N. Adya
Ahmed H. Tewfik
VLM
60
0
0
13 Jun 2024
Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
Miaosen Zhang
Yixuan Wei
Zhen Xing
Yifei Ma
Zuxuan Wu
...
Zheng-Wei Zhang
Qi Dai
Chong Luo
Xin Geng
Baining Guo
VLM
51
1
0
13 Jun 2024
Conceptual Learning via Embedding Approximations for Reinforcing Interpretability and Transparency
Maor Dikter
Tsachi Blau
Chaim Baskin
43
0
0
13 Jun 2024
Labeling Comic Mischief Content in Online Videos with a Multimodal Hierarchical-Cross-Attention Model
Elaheh Baharlouei
Mahsa Shafaei
Yigeng Zhang
Hugo Jair Escalante
Thamar Solorio
51
0
0
12 Jun 2024
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Chenyu Yang
Xizhou Zhu
Jinguo Zhu
Weijie Su
Junjie Wang
...
Lewei Lu
Bin Li
Jie Zhou
Yu Qiao
Jifeng Dai
VLM
CLIP
47
5
0
11 Jun 2024
Integrating Text and Image Pre-training for Multi-modal Algorithmic Reasoning
Zijian Zhang
Wei Liu
37
0
0
08 Jun 2024
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Hao Fang
Jiawei Kong
Wenbo Yu
Bin Chen
Jiawei Li
Hao Wu
Ke Xu
Ke Xu
AAML
VLM
40
13
0
08 Jun 2024
ArMeme: Propagandistic Content in Arabic Memes
Firoj Alam
A. Hasnat
Fatema Ahmed
Md. Arid Hasan
Maram Hasanain
56
7
0
06 Jun 2024
Previous
1
2
3
4
5
6
...
40
41
42
Next