Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.02265
Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"
50 / 2,090 papers shown
Title
Domain Adaptation of VLM for Soccer Video Understanding
Tiancheng Jiang
Henry Wang
Md Sirajus Salekin
Parmida Atighehchian
Shinan Zhang
VLM
7
0
0
20 May 2025
Enhanced Multimodal Hate Video Detection via Channel-wise and Modality-wise Fusion
Yinghui Zhang
Tailin Chen
Yuchen Zhang
Zeyu Fu
9
0
0
17 May 2025
Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
Yiran Chen
Hao Peng
Tong Zhang
Heng Ji
VLM
28
0
0
13 May 2025
Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models
Aishwarya Venkataramanan
P. Bodesheim
Joachim Denzler
BDL
VLM
64
0
0
08 May 2025
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
Xianhang Li
Y. Liu
Haoqin Tu
Hongru Zhu
Cihang Xie
VLM
180
0
0
07 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Xuzhi Zhang
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
74
0
0
05 May 2025
Compositional Image-Text Matching and Retrieval by Grounding Entities
Madhukar Reddy Vongala
Saurabh Srivastava
Jana Kosecka
CLIP
CoGe
VLM
36
0
0
04 May 2025
A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI
Lik Hang Kenny Wong
Xueyang Kang
Kaixin Bai
Jianwei Zhang
59
0
0
01 May 2025
Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models
Minh-Hao Van
Xintao Wu
VLM
88
0
0
30 Apr 2025
DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation
Yinfeng Yu
Dongsheng Yang
24
0
0
30 Apr 2025
Synergy-CLIP: Extending CLIP with Multi-modal Integration for Robust Representation Learning
Sangyeon Cho
Jangyeong Jeon
Mingi Kim
Junyeong Kim
CLIP
VLM
83
0
0
30 Apr 2025
Multimodal Large Language Models for Medicine: A Comprehensive Survey
Jiarui Ye
Hao Tang
LM&MA
91
0
0
29 Apr 2025
Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI
Hugo Georgenthum
Cristian Cosentino
Fabrizio Marozzo
Pietro Liò
MedIm
180
0
0
28 Apr 2025
ShapeSpeak: Body Shape-Aware Textual Alignment for Visible-Infrared Person Re-Identification
Shuanglin Yan
Neng Dong
Shuang Li
Rui Yan
Hao Tang
Jing Qin
169
0
0
25 Apr 2025
A Genealogy of Multi-Sensor Foundation Models in Remote Sensing
Kevin Lane
Morteza Karimzadeh
41
0
0
24 Apr 2025
Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering
Ali Anaissi
Junaid Akram
Kunal Chaturvedi
Ali Braytee
33
0
0
23 Apr 2025
VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform
Xingyu Lu
Tianke Zhang
Chang Meng
Xinyu Wang
Jinpeng Wang
...
Hai-Tao Zheng
Fan Yang
Tingting Gao
Di Zhang
Kun Gai
OffRL
54
0
0
21 Apr 2025
EmoSEM: Segment and Explain Emotion Stimuli in Visual Art
Jing Zhang
Dan Guo
Zhangbin Li
Meng Wang
36
0
0
20 Apr 2025
Hadamard product in deep learning: Introduction, Advances and Challenges
Grigorios G. Chrysos
Yongtao Wu
Razvan Pascanu
Philip Torr
V. Cevher
AAML
98
0
0
17 Apr 2025
DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis
Efthymios Georgiou
V. Katsouros
Yannis Avrithis
Alexandros Potamianos
24
1
0
15 Apr 2025
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
Cheng-Yu Hsieh
Pavan Kumar Anasosalu Vasu
Fartash Faghri
Raviteja Vemulapalli
Chun-Liang Li
Ranjay Krishna
Oncel Tuzel
Hadi Pouransari
VLM
192
0
0
11 Apr 2025
TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs
Zijian Zhang
Xuhui Zheng
X. Wu
Chong Peng
Xuezhi Cao
37
0
0
10 Apr 2025
Zeus: Zero-shot LLM Instruction for Union Segmentation in Multimodal Medical Imaging
Siyuan Dai
Kai Ye
Guodong Liu
Haoteng Tang
Liang Zhan
MedIm
38
0
0
09 Apr 2025
Neutralizing the Narrative: AI-Powered Debiasing of Online News Articles
Chen Wei Kuo
Kevin Chu
Nouar Aldahoul
Hazem Ibrahim
Talal Rahwan
Yasir Zaki
SyDa
63
0
0
04 Apr 2025
Locations of Characters in Narratives: Andersen and Persuasion Datasets
Batuhan Ozyurt
Roya Arkhmammadova
Deniz Yuret
39
0
0
04 Apr 2025
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Xiaofeng Han
Shunpeng Chen
Zenghuang Fu
Zhe Feng
Lue Fan
...
Li Guo
Weiliang Meng
Xiaopeng Zhang
Rongtao Xu
Shibiao Xu
74
1
0
03 Apr 2025
Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention
Jiuniu Wang
Wenjia Xu
Qingzhong Wang
Antoni B. Chan
45
0
0
03 Apr 2025
ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
Yuejiao Su
Yi Wang
Qiongyang Hu
Chuang Yang
Lap-Pui Chau
47
0
0
02 Apr 2025
RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning
Alexander Vogel
Omar Moured
Yufan Chen
Jiaming Zhang
Rainer Stiefelhagen
37
0
0
29 Mar 2025
CTRL-O: Language-Controllable Object-Centric Visual Representation Learning
Aniket Didolkar
Andrii Zadaianchuk
Rabiul Awal
Maximilian Seitzer
E. Gavves
Aishwarya Agrawal
OCL
VLM
89
2
0
27 Mar 2025
VGAT: A Cancer Survival Analysis Framework Transitioning from Generative Visual Question Answering to Genomic Reconstruction
Zizhi Chen
Minghao Han
Xukun Zhang
Shuwei Ma
Tao Liu
Xing Wei
Li Zhang
49
0
0
25 Mar 2025
VisualQuest: A Diverse Image Dataset for Evaluating Visual Recognition in LLMs
Kelaiti Xiao
Liang Yang
Paerhati Tulajiang
Hongfei Lin
MLLM
77
0
0
25 Mar 2025
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Ziming Wei
Bingqian Lin
Yunshuang Nie
Jiaqi Chen
Shikui Ma
Hang Xu
Xiaodan Liang
56
0
0
23 Mar 2025
A Language Anchor-Guided Method for Robust Noisy Domain Generalization
Zilin Dai
Lehong Wang
Fangzhou Lin
Yidong Wang
Zhigang Li
Kazunori D Yamada
Ziming Zhang
Wang Lu
158
0
0
21 Mar 2025
A Survey on fMRI-based Brain Decoding for Reconstructing Multimodal Stimuli
Pengyu Liu
Guohua Dong
D. Guo
Kun Li
Fengling Li
Xun Yang
Meng Wang
Xiaomin Ying
AI4CE
41
0
0
20 Mar 2025
HA-VLN: A Benchmark for Human-Aware Navigation in Discrete-Continuous Environments with Dynamic Multi-Human Interactions, Real-World Validation, and an Open Leaderboard
Yifei Dong
Fengyi Wu
Qi He
Heng Li
Minghan Li
...
Yuxuan Zhou
Jingdong Sun
Qi Dai
Zhi-Qi Cheng
Alexander G. Hauptmann
LM&Ro
50
0
0
18 Mar 2025
Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives
Sara Sarto
Marcella Cornia
Rita Cucchiara
48
0
0
18 Mar 2025
Quantum EigenGame for excited state calculation
David Quiroga
Jason Han
Anastasios Kyrillidis
53
0
0
17 Mar 2025
DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models
Haoyang Li
Liang Wang
Chao Wang
Jing Jiang
Yan Peng
Guodong Long
VLM
74
1
0
17 Mar 2025
Learning Privacy from Visual Entities
Alessio Xompero
Andrea Cavallaro
SSL
GNN
66
0
0
16 Mar 2025
DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models
Xirui Zhou
Lianlei Shan
Xiaolin Gui
66
0
0
14 Mar 2025
Can LLMs Understand Time Series Anomalies?
Zihao Zhou
Rose Yu
AI4TS
87
8
0
13 Mar 2025
FlowTok: Flowing Seamlessly Across Text and Image Tokens
Ju He
Qihang Yu
Qihao Liu
Liang-Chieh Chen
71
0
0
13 Mar 2025
Towards Understanding Graphical Perception in Large Multimodal Models
Kai Zhang
Jianwei Yang
J. Inala
Chandan Singh
Jianfeng Gao
Yu Su
Chenglong Wang
53
1
0
13 Mar 2025
Seeing and Reasoning with Confidence: Supercharging Multimodal LLMs with an Uncertainty-Aware Agentic Framework
Zhuo Zhi
Chen Feng
Adam Daneshmend
Mine Orlu
Andreas Demosthenous
L. Yin
Da Li
Ziquan Liu
Miguel R. D. Rodrigues
LRM
72
1
0
11 Mar 2025
Anatomy-Aware Conditional Image-Text Retrieval
Meng Zheng
Jiajin Zhang
Benjamin Planche
Zhongpai Gao
Terrence Chen
Ziyan Wu
MedIm
57
0
0
10 Mar 2025
Federated Multimodal Learning with Dual Adapters and Selective Pruning for Communication and Computational Efficiency
Duy Phuong Nguyen
J. P. Muñoz
Tanya Roosta
Ali Jannesari
FedML
67
0
0
10 Mar 2025
TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems
Khang H. N. Vo
D. Q. Nguyen
T. Nguyen
Tho Quan
50
0
0
09 Mar 2025
Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions
Chan hur
Jeong-hun Hong
Dong-hun Lee
Dabin Kang
Semin Myeong
Sang-hyo Park
Hyeyoung Park
60
0
0
07 Mar 2025
RCRank: Multimodal Ranking of Root Causes of Slow Queries in Cloud Database Systems
Biao Ouyang
Yingying Zhang
Hanyin Cheng
Yang Shu
Chenjuan Guo
Bin Yang
Qingsong Wen
L. Fan
Christian S. Jensen
56
1
0
06 Mar 2025
1
2
3
4
...
40
41
42
Next