Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.06066
Cited By
v1
v2
v3 (latest)
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
16 August 2019
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
SSL
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"
50 / 512 papers shown
Title
Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation
Yong-Jin Liu
SongLi Wu
Sule Bai
Jiahao Wang
Yitong Wang
Yansong Tang
VLM
VOS
57
0
0
19 Jun 2025
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Xiao Xu
L. Qin
Wanxiang Che
Min-Yen Kan
MoE
VLM
34
0
0
13 Jun 2025
Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations
Yibo Cui
Liang Xie
Yu Zhao
Jiawei Sun
Erwei Yin
17
0
0
10 Jun 2025
Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI
Hugo Georgenthum
Cristian Cosentino
Fabrizio Marozzo
Pietro Liò
MedIm
443
0
0
28 Apr 2025
A Survey of Task-Oriented Knowledge Graph Reasoning: Status, Applications, and Prospects
Guanglin Niu
Bo Li
Yangguang Lin
LRM
54
0
0
27 Apr 2025
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
Cheng-Yu Hsieh
Pavan Kumar Anasosalu Vasu
Fartash Faghri
Raviteja Vemulapalli
Chun-Liang Li
Ranjay Krishna
Oncel Tuzel
Hadi Pouransari
VLM
472
0
0
11 Apr 2025
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Ziming Wei
Bingqian Lin
Yunshuang Nie
Jiaqi Chen
Shikui Ma
Hang Xu
Xiaodan Liang
151
1
0
23 Mar 2025
Optimal Transport for Brain-Image Alignment: Unveiling Redundancy and Synergy in Neural Information Processing
Yang Xiao
Wang Lu
Jie Ji
Ruimeng Ye
Gen Li
Xiaolong Ma
Bo Hui
OT
97
0
0
09 Mar 2025
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Miran Heo
Min-Hung Chen
De-An Huang
Sifei Liu
Subhashree Radhakrishnan
Seon Joo Kim
Yu-Chun Wang
Ryo Hachiuma
ObjD
VLM
276
3
0
14 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLM
VLM
LRM
363
59
0
03 Jan 2025
DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data
Yuanpeng Tu
Xi Chen
Ser-Nam Lim
Hengshuang Zhao
188
1
0
03 Jan 2025
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Wenqi Zhang
Hang Zhang
Xin Li
Jiashuo Sun
Yongliang Shen
Weiming Lu
Deli Zhao
Yueting Zhuang
Lidong Bing
VLM
173
2
0
01 Jan 2025
CorrCLIP: Reconstructing Correlations in CLIP with Off-the-Shelf Foundation Models for Open-Vocabulary Semantic Segmentation
Dengke Zhang
Fagui Liu
Quan Tang
VLM
157
2
0
15 Nov 2024
Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP
Chen Huang
Skyler Seto
Samira Abnar
David Grangier
Navdeep Jaitly
J. Susskind
VLM
77
1
0
31 Oct 2024
ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question Answering
Nghia Hieu Nguyen
Tho Thanh Quan
Ngan Luu-Thuy Nguyen
75
0
0
18 Oct 2024
CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training
Zhiyuan Ma
Jianjun Li
Guohui Li
Kaiyan Huang
VLM
120
9
0
16 Oct 2024
Leveraging Customer Feedback for Multi-modal Insight Extraction
Sandeep Sricharan Mukku
Abinesh Kanagarajan
Pushpendu Ghosh
Chetan Aggarwal
29
0
0
13 Oct 2024
Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity
Hanqi Jiang
Xixuan Hao
Yuzhou Huang
Chong Ma
Jiaxun Zhang
Yi Pan
Ruimao Zhang
MedIm
175
0
0
01 Oct 2024
VidLPRO: A
V
i
d
‾
\underline{Vid}
Vi
d
eo-
L
‾
\underline{L}
L
anguage
P
‾
\underline{P}
P
re-training Framework for
R
o
‾
\underline{Ro}
R
o
botic and Laparoscopic Surgery
Mohammadmahdi Honarmand
Muhammad Abdullah Jamal
Omid Mohareri
145
2
0
07 Sep 2024
A Survey on Integrated Sensing, Communication, and Computation
Dingzhu Wen
Yong Zhou
Xiaoyang Li
Yuanming Shi
Kaibin Huang
Khaled B. Letaief
74
33
0
15 Aug 2024
ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation
Jingyun Wang
Guoliang Kang
VLM
SSL
103
7
0
13 Aug 2024
Efficient and Versatile Robust Fine-Tuning of Zero-shot Models
Sungyeon Kim
Boseung Jeong
Donghyun Kim
Suha Kwak
VLM
88
3
0
11 Aug 2024
FlexAttention for Efficient High-Resolution Vision-Language Models
Junyan Li
Delin Chen
Tianle Cai
Peihao Chen
Yining Hong
Zhenfang Chen
Yikang Shen
Chuang Gan
VLM
125
5
0
29 Jul 2024
HAPFI: History-Aware Planning based on Fused Information
Sujin Jeon
Suyeon Shin
Byoung-Tak Zhang
58
0
0
23 Jul 2024
I Know About "Up"! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction
Zaiqiao Meng
Hao Zhou
Yifang Chen
68
4
0
19 Jul 2024
Precision at Scale: Domain-Specific Datasets On-Demand
Jesús M. Rodríguez-de-Vera
Imanol G. Estepa
Ignacio Sarasúa
Bhalaji Nagarajan
Petia Radeva
87
2
0
03 Jul 2024
Cross-Modal Learning for Anomaly Detection in Fused Magnesium Smelting Process: Methodology and Benchmark
Gaochang Wu
Yapeng Zhang
Lan Deng
Jingxin Zhang
Tianyou Chai
56
7
0
13 Jun 2024
Labeling Comic Mischief Content in Online Videos with a Multimodal Hierarchical-Cross-Attention Model
Elaheh Baharlouei
Mahsa Shafaei
Yigeng Zhang
Hugo Jair Escalante
Thamar Solorio
81
0
0
12 Jun 2024
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Chenyu Yang
Xizhou Zhu
Jinguo Zhu
Weijie Su
Junjie Wang
...
Lewei Lu
Bin Li
Jie Zhou
Yu Qiao
Jifeng Dai
VLM
CLIP
87
6
0
11 Jun 2024
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Hao Fang
Jiawei Kong
Wenbo Yu
Bin Chen
Jiawei Li
Hao Wu
Ke Xu
Ke Xu
AAML
VLM
133
14
0
08 Jun 2024
Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text Matching
Xuri Ge
Fuhai Chen
Songpei Xu
Fuxiang Tao
Jie Wang
Joemon M. Jose
65
1
0
05 Jun 2024
Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling
Cristian Rodriguez-Opazo
Ehsan Abbasnejad
Damien Teney
Edison Marrese-Taylor
Hamed Damirchi
Anton Van Den Hengel
VLM
136
1
0
27 May 2024
ColorFoil: Investigating Color Blindness in Large Vision and Language Models
Ahnaf Mozib Samin
M. F. Ahmed
Md. Mushtaq Shahriyar Rafee
VLM
119
3
0
19 May 2024
SignAvatar: Sign Language 3D Motion Reconstruction and Generation
Lu Dong
Lipisha Chaudhary
Fei Xu
Xiao Wang
Mason Lary
Ifeoma Nwogu
SLR
56
4
0
13 May 2024
3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting
Xuri Ge
Songpei Xu
Fuhai Chen
Jie Wang
Guoxin Wang
Shan An
Joemon M. Jose
3DPC
110
12
0
26 Apr 2024
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
Xuzheng Yu
Chen Jiang
Xingning Dong
Tian Gan
Ming Yang
Qingpei Guo
117
2
0
22 Apr 2024
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
Jintao Sun
Zhedong Zheng
Gangyi Ding
Gangyi Ding
124
8
0
16 Apr 2024
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
Quan Van Nguyen
Dan Quang Tran
Huy Quang Pham
Thang Kien-Bao Nguyen
Nghia Hieu Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
CoGe
172
5
0
16 Apr 2024
Transferable and Principled Efficiency for Open-Vocabulary Segmentation
Jingxuan Xu
Wuyang Chen
Yao-Min Zhao
Yunchao Wei
VLM
100
2
0
11 Apr 2024
Hyperbolic Learning with Synthetic Captions for Open-World Detection
Fanjie Kong
Yanbei Chen
Jiarui Cai
Davide Modolo
VLM
ObjD
67
7
0
07 Apr 2024
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Weifeng Lin
Xinyu Wei
Ruichuan An
Peng Gao
Bocheng Zou
Yulin Luo
Siyuan Huang
Shanghang Zhang
Hongsheng Li
VLM
184
47
0
29 Mar 2024
UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction
Xixuan Hao
Wei Chen
Yibo Yan
Siru Zhong
Kun Wang
Qingsong Wen
Yuxuan Liang
VLM
118
1
0
25 Mar 2024
VidLA: Video-Language Alignment at Scale
Mamshad Nayeem Rizve
Fan Fei
Jayakrishnan Unnikrishnan
Son Tran
Benjamin Z. Yao
Belinda Zeng
Mubarak Shah
Trishul Chilimbi
VLM
AI4TS
92
4
0
21 Mar 2024
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes
Ting Yu
Xiaojun Lin
Shuhui Wang
Weiguo Sheng
Qingming Huang
Jun-chen Yu
3DV
88
10
0
12 Mar 2024
Towards Deviation-Robust Agent Navigation via Perturbation-Aware Contrastive Learning
Bingqian Lin
Yanxin Long
Yi Zhu
Fengda Zhu
Xiaodan Liang
QiXiang Ye
Liang Lin
84
5
0
09 Mar 2024
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Weiyun Wang
Yiming Ren
Hao Luo
Tiantong Li
Chenxiang Yan
...
Qingyun Li
Lewei Lu
Xizhou Zhu
Yu Qiao
Jifeng Dai
MLLM
143
53
0
29 Feb 2024
Automatic Creative Selection with Cross-Modal Matching
Alex Kim
Jia Huang
Rob Monarch
Jerry Kwac
Anikesh Kamath
P. Khurd
Kailash Thiyagarajan
Goodman Gu
VLM
40
0
0
28 Feb 2024
Acquiring Linguistic Knowledge from Multimodal Input
Theodor Amariucai
Alexander Scott Warstadt
CLL
91
2
0
27 Feb 2024
Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning
Maurits J. R. Bleeker
Mariya Hendriksen
Andrew Yates
Maarten de Rijke
VLM
97
2
0
27 Feb 2024
CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora
Zijun Long
Xuri Ge
R. McCreadie
Joemon M. Jose
75
7
0
23 Feb 2024
1
2
3
4
...
9
10
11
Next