Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.02265
Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"
50 / 2,118 papers shown
Title
The Odyssey of Commonsense Causality: From Foundational Benchmarks to Cutting-Edge Reasoning
Shaobo Cui
Zhijing Jin
Bernhard Schölkopf
Boi Faltings
CML
LRM
93
4
0
27 Jun 2024
Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation
Malvina Nikandrou
Georgios Pantazopoulos
Ioannis Konstas
Alessandro Suglia
79
2
0
27 Jun 2024
Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions
Minghan Li
Heng Li
Zhi-Qi Cheng
Yifei Dong
Yuxuan Zhou
Jun-Yan He
Qi Dai
Teruko Mitamura
Alexander G. Hauptmann
LM&Ro
92
6
0
27 Jun 2024
Improving the Consistency in Cross-Lingual Cross-Modal Retrieval with 1-to-K Contrastive Learning
Zhijie Nie
Richong Zhang
Zhangchi Feng
Hailang Huang
Xudong Liu
100
3
0
26 Jun 2024
ScanFormer: Referring Expression Comprehension by Iteratively Scanning
Wei Su
Peihan Miao
Huanzhang Dou
Xi Li
ObjD
105
9
0
26 Jun 2024
A Survey on Mixture of Experts in Large Language Models
Weilin Cai
Juyong Jiang
Fan Wang
Jing Tang
Sunghun Kim
Jiayi Huang
MoE
98
123
0
26 Jun 2024
Towards a Science Exocortex
Kevin G. Yager
116
2
0
24 Jun 2024
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
Michal Golovanevsky
William Rudman
Vedant Palit
Ritambhara Singh
Carsten Eickhoff
130
3
0
24 Jun 2024
Multi-Scale Temporal Difference Transformer for Video-Text Retrieval
Ni Wang
Dongliang Liao
Xing Xu
75
1
0
23 Jun 2024
Towards Natural Language-Driven Assembly Using Foundation Models
O. Joglekar
Tal Lancewicki
Shir Kozlovsky
Vladimir Tchuiev
Zohar Feldman
Dotan Di Castro
LM&Ro
80
0
0
23 Jun 2024
Multi-Granularity and Multi-modal Feature Interaction Approach for Text Video Retrieval
Wenjun Li
Shudong Wang
Dong Zhao
Shenghui Xu
Zhaoming Pan
Zhimin Zhang
59
0
0
21 Jun 2024
Composing Object Relations and Attributes for Image-Text Matching
Khoi Pham
Chuong Huynh
Ser-Nam Lim
Abhinav Shrivastava
CoGe
79
8
0
17 Jun 2024
They're All Doctors: Synthesizing Diverse Counterfactuals to Mitigate Associative Bias
Salma Abdel Magid
Jui-Hsien Wang
Kushal Kafle
Hanspeter Pfister
128
1
0
17 Jun 2024
Open-Vocabulary X-ray Prohibited Item Detection via Fine-tuning CLIP
Shuyang Lin
Tong Jia
Hao Wang
Bowen Ma
Mingyuan Li
Dongyue Chen
VLM
ObjD
84
0
0
16 Jun 2024
MDeRainNet: An Efficient Macro-pixel Image Rain Removal Network
Tao Yan
Weijiang He
Chenglong Wang
Cihang Wei
Xiangjie Zhu
Yinghui Wang
Rynson W. H. Lau
93
0
0
15 Jun 2024
Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection
Shruti Palaskar
Oggi Rudovic
Sameer Dharur
Florian Pesce
G. Krishna
Aswin Sivaraman
Jack Berkowitz
Ahmed Hussen Abdelaziz
Saurabh N. Adya
Ahmed H. Tewfik
VLM
88
0
0
13 Jun 2024
Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
Miaosen Zhang
Yixuan Wei
Zhen Xing
Yifei Ma
Zuxuan Wu
...
Zheng Zhang
Qi Dai
Chong Luo
Xin Geng
Baining Guo
VLM
86
1
0
13 Jun 2024
Conceptual Learning via Embedding Approximations for Reinforcing Interpretability and Transparency
Maor Dikter
Tsachi Blau
Chaim Baskin
128
0
0
13 Jun 2024
Labeling Comic Mischief Content in Online Videos with a Multimodal Hierarchical-Cross-Attention Model
Elaheh Baharlouei
Mahsa Shafaei
Yigeng Zhang
Hugo Jair Escalante
Thamar Solorio
88
0
0
12 Jun 2024
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Chenyu Yang
Xizhou Zhu
Jinguo Zhu
Weijie Su
Junjie Wang
...
Lewei Lu
Bin Li
Jie Zhou
Yu Qiao
Jifeng Dai
VLM
CLIP
87
6
0
11 Jun 2024
Integrating Text and Image Pre-training for Multi-modal Algorithmic Reasoning
Zijian Zhang
Wei Liu
100
0
0
08 Jun 2024
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Hao Fang
Jiawei Kong
Wenbo Yu
Bin Chen
Jiawei Li
Hao Wu
Ke Xu
Ke Xu
AAML
VLM
133
14
0
08 Jun 2024
ArMeme: Propagandistic Content in Arabic Memes
Firoj Alam
A. Hasnat
Fatema Ahmed
Md. Arid Hasan
Maram Hasanain
78
8
0
06 Jun 2024
Attribute-Aware Implicit Modality Alignment for Text Attribute Person Search
Xin Wang
Fangfang Liu
Zheng Li
Caili Guo
107
1
0
06 Jun 2024
MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition
Stefan Gerd Fritsch
Cennet Oğuz
Vitor Fortes Rey
L. Ray
Maximilian Kiefer-Emmanouilidis
Paul Lukowicz
HAI
116
0
0
06 Jun 2024
FILS: Self-Supervised Video Feature Prediction In Semantic Language Space
Mona Ahmadian
Frank Guerin
Andrew Gilbert
119
1
0
05 Jun 2024
Translation Deserves Better: Analyzing Translation Artifacts in Cross-lingual Visual Question Answering
Yujin Baek
Koanho Lee
Hyesu Lim
Jaeseok Kim
Junmo Park
Yu-Jung Heo
Du-Seong Chang
Jaegul Choo
42
3
0
04 Jun 2024
Multimodal Reasoning with Multimodal Knowledge Graph
Junlin Lee
Yequan Wang
Jing Li
Min Zhang
98
23
0
04 Jun 2024
Progressive Confident Masking Attention Network for Audio-Visual Segmentation
Yuxuan Wang
Feng Dong
Jinchao Zhu
Shuyue Zhu
VOS
163
0
0
04 Jun 2024
Augmented Commonsense Knowledge for Remote Object Grounding
Bahram Mohammadi
Yicong Hong
Yuankai Qi
Qi Wu
Shirui Pan
Javen Qinfeng Shi
100
8
0
03 Jun 2024
GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer
Ding Jia
Jianyuan Guo
Kai Han
Han Wu
Chao Zhang
Chang Xu
Xinghao Chen
ViT
168
24
0
03 Jun 2024
Towards Rationality in Language and Multimodal Agents: A Survey
Bowen Jiang
Yangxinyu Xie
Xiaomeng Wang
Yuan Yuan
Camillo J Taylor
Tanwi Mallick
Weijie J. Su
Camillo J. Taylor
Tanwi Mallick
LLMAG
89
6
0
01 Jun 2024
Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models
Yi Yang
Qingwen Zhang
Kei Ikemura
Nazre Batool
John Folkesson
VLM
77
2
0
31 May 2024
Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning
Cheng Tan
Jingxuan Wei
Linzhuang Sun
Zhangyang Gao
Siyuan Li
Bihui Yu
Ruifeng Guo
Stan Z. Li
ReLM
LRM
3DV
112
7
0
31 May 2024
Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models
Himangi Mittal
Nakul Agarwal
Shao-Yuan Lo
Kwonjoon Lee
121
18
0
30 May 2024
ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions
Honglin Lin
Siyu Li
Gu Nan
Chaoyue Tang
Xueting Wang
...
Yankai Rong
Zhili Zhou
Yutong Gao
Qimei Cui
Xiaofeng Tao
52
0
0
29 May 2024
Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval
Rui Yang
Shuang Wang
Yi Han
Yuanheng Li
Dong Zhao
Dou Quan
Yanhe Guo
Licheng Jiao
92
4
0
29 May 2024
Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR
Zhenyang Li
Yangyang Guo
Ke-Jyun Wang
Xiaolin Chen
Liqiang Nie
Mohan S. Kankanhalli
LRM
54
8
0
27 May 2024
Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion
Zizhao Hu
Mohammad Rostami
81
0
0
25 May 2024
Planted: a dataset for planted forest identification from multi-satellite time series
L. M. Pazos-Outón
Cristina Nader Vasconcelos
Anton Raichuk
Anurag Arnab
Dan Morris
Maxim Neumann
85
5
0
24 May 2024
Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer
Zichen Geng
Caren Han
Zeeshan Hayder
Jian Liu
Mubarak Shah
Ajmal Mian
68
4
0
24 May 2024
What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models
Abdelrahman Abdelhamed
Mahmoud Afifi
Alec Go
MLLM
VLM
165
3
0
24 May 2024
Distilling Vision-Language Pretraining for Efficient Cross-Modal Retrieval
Young Kyun Jang
Donghyun Kim
Ser-nam Lim
VLM
64
0
0
23 May 2024
Boosting Medical Image-based Cancer Detection via Text-guided Supervision from Reports
Guangyu Guo
Jiawen Yao
Yingda Xia
Tony C. W. Mok
Zhilin Zheng
Junwei Han
Le Lu
Dingwen Zhang
Jian Zhou
Ling Zhang
68
1
0
23 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
337
54
0
23 May 2024
From CNNs to Transformers in Multimodal Human Action Recognition: A Survey
Muhammad Bilal Shaikh
Syed Mohammed Shamsul Islam
Douglas Chai
Naveed Akhtar
108
10
0
22 May 2024
Comprehensive Multimodal Deep Learning Survival Prediction Enabled by a Transformer Architecture: A Multicenter Study in Glioblastoma
A. Gomaa
Yixing Huang
Amr Hagag
Charlotte Schmitter
Daniel Höfler
...
U. Gaipl
S. Semrau
Christoph Bert
R. Fietkau
F. Putz
66
9
0
21 May 2024
A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech Embeddings
Tariq Adnan
Abdelrahman Abdelkader
Zipei Liu
Ekram Hossain
Sooyong Park
Md. Saiful Islam
Ehsan Hoque
58
2
0
21 May 2024
ColorFoil: Investigating Color Blindness in Large Vision and Language Models
Ahnaf Mozib Samin
M. F. Ahmed
Md. Mushtaq Shahriyar Rafee
VLM
119
3
0
19 May 2024
MemeMQA: Multimodal Question Answering for Memes via Rationale-Based Inferencing
Siddhant Agarwal
Shivam Sharma
Preslav Nakov
Tanmoy Chakraborty
94
4
0
18 May 2024
Previous
1
2
3
...
5
6
7
...
41
42
43
Next