Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.02265
Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"
50 / 2,118 papers shown
Title
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
Tongtian Yue
Longteng Guo
Yepeng Tang
Zijia Zhao
Xinxin Zhu
Hua Huang
Jing Liu
MLLM
VLM
21
0
0
20 Jun 2025
Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation
Yong-Jin Liu
SongLi Wu
Sule Bai
Jiahao Wang
Yitong Wang
Yansong Tang
VLM
VOS
61
0
0
19 Jun 2025
Understanding GUI Agent Localization Biases through Logit Sharpness
Xingjian Tao
Yiwei Wang
Yujun Cai
Zhicheng YANG
Jing Tang
LLMAG
19
0
0
18 Jun 2025
HierVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment
Numair Nadeem
Saeed Anwar
Muhammad Asad
Abdul Bais
VLM
31
0
0
16 Jun 2025
Dynamic Modality Scheduling for Multimodal Large Models via Confidence, Uncertainty, and Semantic Consistency
Hiroshi Tanaka
Anika Rao
Hana Satou
Michael Johnson
Sofia García
25
0
0
15 Jun 2025
Generative or Discriminative? Revisiting Text Classification in the Era of Transformers
Siva Rajesh Kasa
Karan Gupta
Sumegh Roychowdhury
Ashutosh Kumar
Yaswanth Biruduraju
Santhosh Kumar Kasa
Nikhil Pattisapu
Arindam Bhattacharya
Shailendra Agarwal
Vijay huddar
27
0
0
13 Jun 2025
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Xiao Xu
L. Qin
Wanxiang Che
Min-Yen Kan
MoE
VLM
38
0
0
13 Jun 2025
Intention-Conditioned Flow Occupancy Models
Chongyi Zheng
S. Park
Sergey Levine
Benjamin Eysenbach
AI4TS
OffRL
AI4CE
48
0
0
10 Jun 2025
MrM: Black-Box Membership Inference Attacks against Multimodal RAG Systems
Peiru Yang
Jinhua Yin
Haoran Zheng
Xueying Bai
Huili Wang
Yufei Sun
Xintian Li
Shangguang Wang
Yongfeng Huang
Tao Qi
AAML
23
0
0
09 Jun 2025
Representation Decomposition for Learning Similarity and Contrastness Across Modalities for Affective Computing
Yuanhe Tian
Pengsen Cheng
Guoqing Jin
Lei Zhang
Yan Song
33
1
0
08 Jun 2025
OpenFace 3.0: A Lightweight Multitask System for Comprehensive Facial Behavior Analysis
Jiewen Hu
Leena Mathur
Paul Pu Liang
Louis-Philippe Morency
CVBM
59
0
0
03 Jun 2025
MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping
Xiaojun Shan
Qi Cao
Xing Han
Haofei Yu
Paul Liang
57
0
0
02 Jun 2025
Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language
Guangfu Hao
Haojie Wen
Liangxuna Guo
Yang Chen
Yanchao Bi
S. Yu
64
0
0
28 May 2025
GETReason: Enhancing Image Context Extraction through Hierarchical Multi-Agent Reasoning
Shikhhar Siingh
Abhinav Rawat
Chitta Baral
Vivek Gupta
54
0
0
28 May 2025
E2E Process Automation Leveraging Generative AI and IDP-Based Automation Agent: A Case Study on Corporate Expense Processing
Cheonsu Jeong
Seongmin Sim
Hyoyoung Cho
Sungsu Kim
Byounggwan Shin
46
1
0
27 May 2025
Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval
Hailong Ning
Siying Wang
Tao Lei
Xiaopeng Cao
Huanmin Dou
Bin Zhao
Asoke K. Nandi
Petia Radeva
43
0
0
22 May 2025
R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO
Huanjin Yao
Qixiang Yin
Jingyi Zhang
Min Yang
Yibo Wang
...
Fei Su
Li Shen
Minghui Qiu
Dacheng Tao
Jiaxing Huang
LRM
72
0
0
22 May 2025
Large Language models for Time Series Analysis: Techniques, Applications, and Challenges
Feifei Shi
Xueyan Yin
Kang Wang
Wanyu Tu
Qifu Sun
Huansheng Ning
AI4TS
26
0
0
21 May 2025
InstanceBEV: Unifying Instance and BEV Representation for Global Modeling
Feng Li
Kun Xu
Zhaoyue Wang
Yunduan Cui
Mohammad Masum Billah
Jia Liu
72
0
0
20 May 2025
ReactDiff: Latent Diffusion for Facial Reaction Generation
Jiaming Li
Sheng Wang
Xin Wang
Yitao Zhu
Honglin Xiong
Zixu Zhuang
Qian Wang
DiffM
VGen
76
0
0
20 May 2025
Domain Adaptation of VLM for Soccer Video Understanding
Tiancheng Jiang
Henry Wang
Md Sirajus Salekin
Parmida Atighehchian
Shinan Zhang
VLM
106
0
0
20 May 2025
Multi-modal contrastive learning adapts to intrinsic dimensions of shared latent variables
Yu Gui
Cong Ma
Zongming Ma
SSL
103
0
0
18 May 2025
Mitigating Hallucinations via Inter-Layer Consistency Aggregation in Large Vision-Language Models
Kai Tang
Jinhao You
Xiuqi Ge
Hanze Li
Yichen Guo
Xiande Huang
MLLM
175
0
0
18 May 2025
Enhanced Multimodal Hate Video Detection via Channel-wise and Modality-wise Fusion
Yinghui Zhang
Tailin Chen
Yuchen Zhang
Zeyu Fu
89
0
0
17 May 2025
DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation
Ziyu Zhao
Xiaoguang Li
Linjia Shi
Nasrin Imanpour
Song Wang
VLM
77
0
0
16 May 2025
GeoMM: On Geodesic Perspective for Multi-modal Learning
Shibin Mei
Hang Wang
Bingbing Ni
82
0
0
16 May 2025
Open Set Domain Adaptation with Vision-language models via Gradient-aware Separation
Haoyang Chen
VLM
80
0
0
16 May 2025
On the Interplay of Human-AI Alignment,Fairness, and Performance Trade-offs in Medical Imaging
Haozhe Luo
Ziyu Zhou
Zixin Shu
Aurélie Pahud de Mortanges
Robert Berke
Mauricio Reyes
75
0
0
15 May 2025
Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
Yiran Chen
Hao Peng
Tong Zhang
Heng Ji
VLM
88
0
0
13 May 2025
Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models
Aishwarya Venkataramanan
P. Bodesheim
Joachim Denzler
BDL
VLM
106
0
0
08 May 2025
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
Xianhang Li
Yixiao Liu
Haoqin Tu
Hongru Zhu
Cihang Xie
VLM
444
2
0
07 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Wei Wei
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
...
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
319
1
0
05 May 2025
Compositional Image-Text Matching and Retrieval by Grounding Entities
Madhukar Reddy Vongala
Saurabh Srivastava
Jana Kosecka
CLIP
CoGe
VLM
99
0
0
04 May 2025
A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI
Lik Hang Kenny Wong
Xueyang Kang
Kaixin Bai
Jianwei Zhang
162
0
0
01 May 2025
Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models
Minh-Hao Van
Xintao Wu
VLM
164
0
0
30 Apr 2025
DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation
Yinfeng Yu
Dongsheng Yang
98
0
0
30 Apr 2025
Synergy-CLIP: Extending CLIP with Multi-modal Integration for Robust Representation Learning
Sangyeon Cho
Jangyeong Jeon
Mingi Kim
Junyeong Kim
CLIP
VLM
241
0
0
30 Apr 2025
Multimodal Large Language Models for Medicine: A Comprehensive Survey
Jiarui Ye
Hao Tang
LM&MA
191
0
0
29 Apr 2025
Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI
Hugo Georgenthum
Cristian Cosentino
Fabrizio Marozzo
Pietro Liò
MedIm
443
0
0
28 Apr 2025
A Survey of Task-Oriented Knowledge Graph Reasoning: Status, Applications, and Prospects
Guanglin Niu
Bo Li
Yangguang Lin
LRM
59
0
0
27 Apr 2025
ShapeSpeak: Body Shape-Aware Textual Alignment for Visible-Infrared Person Re-Identification
Shuanglin Yan
Neng Dong
Shuang Li
Rui Yan
Hao Tang
Jing Qin
439
0
0
25 Apr 2025
A Genealogy of Multi-Sensor Foundation Models in Remote Sensing
Kevin Lane
Morteza Karimzadeh
91
0
0
24 Apr 2025
Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering
Ali Anaissi
Junaid Akram
Kunal Chaturvedi
Ali Braytee
60
0
0
23 Apr 2025
VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform
Xingyu Lu
Tianke Zhang
Chang Meng
Xinyu Wang
Jinpeng Wang
...
Hai-Tao Zheng
Fan Yang
Yan Li
Di Zhang
Kun Gai
OffRL
91
0
0
21 Apr 2025
EmoSEM: Segment and Explain Emotion Stimuli in Visual Art
Jing Zhang
Dan Guo
Zhangbin Li
Meng Wang
89
0
0
20 Apr 2025
Hadamard product in deep learning: Introduction, Advances and Challenges
Grigorios G. Chrysos
Yongtao Wu
Razvan Pascanu
Philip Torr
Volkan Cevher
AAML
172
2
0
17 Apr 2025
DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis
Efthymios Georgiou
Vassilis Katsouros
Yannis Avrithis
Alexandros Potamianos
100
1
0
15 Apr 2025
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
Cheng-Yu Hsieh
Pavan Kumar Anasosalu Vasu
Fartash Faghri
Raviteja Vemulapalli
Chun-Liang Li
Ranjay Krishna
Oncel Tuzel
Hadi Pouransari
VLM
472
0
0
11 Apr 2025
TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs
Zijian Zhang
Xuhui Zheng
X. Wu
Chong Peng
Xuezhi Cao
73
2
0
10 Apr 2025
Zeus: Zero-shot LLM Instruction for Union Segmentation in Multimodal Medical Imaging
Siyuan Dai
Kai Ye
Guodong Liu
Haoteng Tang
Liang Zhan
MedIm
53
0
0
09 Apr 2025
1
2
3
4
...
41
42
43
Next