Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.06066
Cited By
v1
v2
v3 (latest)
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
16 August 2019
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
SSL
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"
50 / 512 papers shown
Title
Exploring Missing Modality in Multimodal Egocentric Datasets
Merey Ramazanova
Alejandro Pardo
Humam Alwassel
Guohao Li
EgoV
71
4
0
21 Jan 2024
POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images
Antonín Vobecký
Oriane Siméoni
David Hurych
Spyros Gidaris
Andrei Bursuc
Patrick Pérez
Josef Sivic
111
35
0
17 Jan 2024
CrisisKAN: Knowledge-infused and Explainable Multimodal Attention Network for Crisis Event Classification
Shubham Gupta
Nandini Saini
Suman Kundu
Debasis Das
61
7
0
11 Jan 2024
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
Yatong Bai
Utsav Garg
Apaar Shanker
Haoming Zhang
Samyak Parajuli
...
Eugenia D Fomitcheva
E. Branson
Aerin Kim
Somayeh Sojoudi
Kyunghyun Cho
56
2
0
09 Jan 2024
FM-AE: Frequency-masked Multimodal Autoencoder for Zinc Electrolysis Plate Contact Abnormality Detection
Can Zhou
Can Zhou
Hongqiu Zhu
Tianhao Liu
26
8
0
08 Jan 2024
Unveiling Backbone Effects in CLIP: Exploring Representational Synergies and Variances
Cristian Rodriguez-Opazo
Edison Marrese-Taylor
Ehsan Abbasnejad
Hamed Damirchi
Ignacio M. Jara
Felipe Bravo-Marquez
Anton Van Den Hengel
VLM
66
1
0
22 Dec 2023
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLM
MLLM
167
36
0
19 Dec 2023
A Foundational Multimodal Vision Language AI Assistant for Human Pathology
Ming Y. Lu
Bowen Chen
Drew F. K. Williamson
Richard J. Chen
Kenji Ikamura
...
Ivy Liang
L. Le
Tong Ding
Anil V. Parwani
Faisal Mahmood
MedIm
LM&MA
86
23
0
13 Dec 2023
Open-Vocabulary Segmentation with Semantic-Assisted Calibration
Yong Liu
Sule Bai
Guanbin Li
Yitong Wang
Yansong Tang
VLM
97
33
0
07 Dec 2023
Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning
Cong Yang
Zuchao Li
Lefei Zhang
72
26
0
02 Dec 2023
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Mu Cai
Haotian Liu
Dennis Park
Siva Karthik Mustikovela
Gregory P. Meyer
Yuning Chai
Yong Jae Lee
VLM
LRM
MLLM
126
99
0
01 Dec 2023
A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval
M. Gwilliam
Michael Cogswell
Meng Ye
Karan Sikka
Abhinav Shrivastava
Ajay Divakaran
3DV
93
1
1
30 Nov 2023
LEAP: LLM-Generation of Egocentric Action Programs
Eadom Dessalene
Michael Maynord
Cornelia Fermuller
Yiannis Aloimonos
80
2
0
29 Nov 2023
Contrastive Vision-Language Alignment Makes Efficient Instruction Learner
Lizhao Liu
Xinyu Sun
Tianhang Xiang
Zhuangwei Zhuang
Liuren Yin
Mingkui Tan
VLM
60
3
0
29 Nov 2023
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer
Jacob Zhiyuan Fang
Skyler Zheng
Vasu Sharma
Robinson Piramuthu
VLM
154
0
0
28 Nov 2023
LANS: A Layout-Aware Neural Solver for Plane Geometry Problem
Zhong-Zhi Li
Ming-Liang Zhang
Fei Yin
Cheng-Lin Liu
82
17
0
25 Nov 2023
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation
Yangyi Chen
Xingyao Wang
Manling Li
Derek Hoiem
Heng Ji
81
12
0
22 Nov 2023
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning
Siyuan Liang
Mingli Zhu
Aishan Liu
Baoyuan Wu
Xiaochun Cao
Ee-Chien Chang
132
58
0
20 Nov 2023
Open-Vocabulary Camouflaged Object Segmentation
Youwei Pang
Xiaoqi Zhao
Jiaming Zuo
Lihe Zhang
Huchuan Lu
VLM
ObjD
100
7
0
19 Nov 2023
Active Prompt Learning in Vision Language Models
Jihwan Bang
Sumyeong Ahn
Jae-Gil Lee
VLM
66
14
0
18 Nov 2023
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
131
72
0
16 Nov 2023
Teach me with a Whisper: Enhancing Large Language Models for Analyzing Spoken Transcripts using Speech Embeddings
Fatema Hasan
Yulong Li
James R. Foulds
Shimei Pan
Bishwaranjan Bhattacharjee
79
2
0
13 Nov 2023
Improving Vision-and-Language Reasoning via Spatial Relations Modeling
Cheng Yang
Rui Xu
Ye Guo
Peixiang Huang
Yiru Chen
Wenkui Ding
Zhongyuan Wang
Hong Zhou
LRM
59
6
0
09 Nov 2023
Lost Your Style? Navigating with Semantic-Level Approach for Text-to-Outfit Retrieval
Junkyu Jang
Eugene Hwang
Sung-Hyuk Park
49
0
0
03 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
151
44
0
01 Nov 2023
M2C: Towards Automatic Multimodal Manga Complement
Hongcheng Guo
Boyang Wang
Jiaqi Bai
Jiaheng Liu
Jian Yang
Zhoujun Li
94
10
0
26 Oct 2023
The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal Models
Xinyi Chen
Raquel Fernández
Sandro Pezzelle
VLM
62
10
0
23 Oct 2023
Jaeger: A Concatenation-Based Multi-Transformer VQA Model
Jieting Long
Zewei Shi
Penghao Jiang
Yidong Gan
53
0
0
11 Oct 2023
I2SRM: Intra- and Inter-Sample Relationship Modeling for Multimodal Information Extraction
Yusheng Huang
Zhouhan Lin
57
5
0
10 Oct 2023
GRID: A Platform for General Robot Intelligence Development
Sai H. Vemprala
Shuhang Chen
Abhinav Shukla
Dinesh Narayanan
Ashish Kapoor
91
10
0
02 Oct 2023
AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ
Jonas Belouadi
Anne Lauscher
Steffen Eger
75
31
0
30 Sep 2023
Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal Sponsored Search
Yuanmin Tang
Daling Wang
Keke Gai
Wenfang Wu
Yifei Zhang
Gang Xiong
Qi Wu
73
4
0
28 Sep 2023
Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval
Yuanmin Tang
Jiahao Yu
Keke Gai
Jiamin Zhuang
Gang Xiong
Yue Hu
Qi Wu
83
39
0
28 Sep 2023
Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer
Zhihao Zhang
Yiwei Chen
Weizhan Zhang
Caixia Yan
Qinghua Zheng
Qi Wang
Wang Chen
43
6
0
26 Sep 2023
VidChapters-7M: Video Chapters at Scale
Antoine Yang
Arsha Nagrani
Ivan Laptev
Josef Sivic
Cordelia Schmid
VGen
98
28
0
25 Sep 2023
A Survey on Image-text Multimodal Models
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai-Nguyen Nguyen
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
VLM
128
7
0
23 Sep 2023
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval
Nina Shvetsova
Anna Kukleva
Bernt Schiele
Hilde Kuehne
DiffM
77
4
0
16 Sep 2023
Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary Tasks
Danae Sánchez Villegas
Daniel Preoctiuc-Pietro
Nikolaos Aletras
64
3
0
14 Sep 2023
Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation
Yunhao Ge
Lyne Tchapmi
Brian Nlong Zhao
Neel Joshi
Laurent Itti
Vibhav Vineet
DiffM
77
14
0
12 Sep 2023
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
LRM
99
27
0
08 Sep 2023
Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification
Zhiyin Shao
Xinyu Zhang
Changxing Ding
Jian Wang
Jingdong Wang
95
19
0
04 Sep 2023
A Fine-Grained Image Description Generation Method Based on Joint Objectives
Yifan Zhang
Chunzhen Lin
Donglin Cao
Dazhen Lin
EGVM
37
0
0
02 Sep 2023
Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes in Product Images for e-commerce Vision-Language Applications
Wenyi Wu
Karim Bouyarmane
Ismail B. Tutar
33
2
0
30 Aug 2023
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection
Yifan Xu
Mengdan Zhang
Xiaoshan Yang
Changsheng Xu
ObjD
80
5
0
30 Aug 2023
Multi-event Video-Text Retrieval
Gengyuan Zhang
Jisen Ren
Jindong Gu
Volker Tresp
85
14
0
22 Aug 2023
Language-guided Human Motion Synthesis with Atomic Actions
Yuanhao Zhai
Mingzhen Huang
Tianyu Luan
Lu Dong
Ifeoma Nwogu
Siwei Lyu
David Doermann
Junsong Yuan
79
13
0
18 Aug 2023
Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning
Ye-Ting Chen
Siyu Zhang
Yaoru Sun
Weijian Liang
Haoran Wang
74
1
0
18 Aug 2023
Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model
Ka Leong Cheng
Wenpo Song
Zheng Ma
Wenhao Zhu
Zi-Yue Zhu
Jianbing Zhang
CLIP
VLM
65
11
0
02 Aug 2023
Robust Visual Question Answering: Datasets, Methods, and Future Challenges
Jie Ma
Pinghui Wang
Dechen Kong
Zewei Wang
Jun Liu
Hongbin Pei
Junzhou Zhao
OOD
126
23
0
21 Jul 2023
PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese
Nghia Hieu Nguyen
Kiet Van Nguyen
47
2
0
17 Jul 2023
Previous
1
2
3
4
5
...
9
10
11
Next