Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1803.08024
Cited By
Stacked Cross Attention for Image-Text Matching
21 March 2018
Kuang-Huei Lee
Xi Chen
G. Hua
Houdong Hu
Xiaodong He
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Stacked Cross Attention for Image-Text Matching"
50 / 161 papers shown
Title
CLIP-Driven Fine-grained Text-Image Person Re-identification
Shuanglin Yan
Neng Dong
Liyan Zhang
Jinhui Tang
39
87
0
19 Oct 2022
Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval
Xuri Ge
Fuhai Chen
Songpei Xu
Fuxiang Tao
J. Jose
30
26
0
17 Oct 2022
Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning
Fuying Wang
Yuyin Zhou
Shujun Wang
V. Vardhanabhuti
Lequan Yu
26
137
0
12 Oct 2022
Transformer-based Localization from Embodied Dialog with Large-scale Pre-training
Meera Hahn
James M. Rehg
LM&Ro
40
4
0
10 Oct 2022
Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information Retrieval
Yan Gong
Georgina Cosma
27
11
0
10 Oct 2022
Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval
Zheng Li
Caili Guo
Xin Wang
Zerun Feng
Jenq-Neng Hwang
Zhongtian Du
VLM
24
2
0
28 Sep 2022
TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval
Xiaohan Zou
Changqiao Wu
Lele Cheng
Zhongyuan Wang
94
6
0
28 Sep 2022
ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding
Wenjin Wang
Zhengjie Huang
Bin Luo
Qianglong Chen
Qiming Peng
...
Weichong Yin
Shi Feng
Yu Sun
Dianhai Yu
Yin Zhang
ViT
30
11
0
18 Sep 2022
CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval
Haoran Wang
Dongliang He
Wenhao Wu
Boyang Xia
Min Yang
Fu Li
Yunlong Yu
Zhong Ji
Errui Ding
Jingdong Wang
30
22
0
21 Aug 2022
Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks
Qingrong Cheng
Keyu Wen
X. Gu
VLM
EGVM
32
16
0
20 Aug 2022
See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval
Xiujun Shu
Wei Wen
Haoqian Wu
Keyun Chen
Yi-Zhe Song
Ruizhi Qiao
Bohan Ren
Xiao Wang
24
91
0
18 Aug 2022
CORNET: Learning Table Formatting Rules By Example
Mukul Singh
J. Cambronero
Sumit Gulwani
Vu Le
Carina Negreanu
Mohammad Raza
Gust Verbruggen
LMTD
56
8
0
11 Aug 2022
A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch
Patsorn Sangkloy
Wittawat Jitkrittum
Diyi Yang
James Hays
3DV
23
32
0
05 Aug 2022
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Ming Yan
Ji Zhang
Rongrong Ji
CLIP
VLM
25
269
0
15 Jul 2022
Automatic Generation of Product-Image Sequence in E-commerce
Xiaochuan Fan
Chi Zhang
Yong-Jie Yang
Yue Shang
Xueying Zhang
Zhen He
Yun Xiao
Bo Long
Lingfei Wu
23
4
0
26 Jun 2022
VD-PCR: Improving Visual Dialog with Pronoun Coreference Resolution
Xintong Yu
Hongming Zhang
Ruixin Hong
Yangqiu Song
Changshui Zhang
17
13
0
29 May 2022
Mutual Information Divergence: A Unified Metric for Multimodal Generative Models
Jin-Hwa Kim
Yunji Kim
Jiyoung Lee
Kang Min Yoo
Sang-Woo Lee
EGVM
36
32
0
25 May 2022
HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval
Feilong Chen
Xiuyi Chen
Jiaxin Shi
Duzhen Zhang
Jianlong Chang
Qi Tian
VLM
CLIP
34
6
0
24 May 2022
Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning
Li Yang
Yan Xu
Chunfen Yuan
Wei Liu
Bing Li
Weiming Hu
ObjD
50
113
0
30 Apr 2022
An Overview of Recent Work in Media Forensics: Methods and Threats
Kratika Bhagtani
A. Yadav
Emily R. Bartusiak
Ziyue Xiang
Ruiting Shao
Sriram Baireddy
Edward J. Delp
AAML
47
25
0
26 Apr 2022
Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval
Zhiqiang Yuan
Wenkai Zhang
Kun Fu
Xuan Li
Chubo Deng
Hongqi Wang
Xian Sun
29
129
0
21 Apr 2022
On Distinctive Image Captioning via Comparing and Reweighting
Jiuniu Wang
Wenjia Xu
Qingzhong Wang
Antoni B. Chan
38
16
0
08 Apr 2022
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation
Wangbo Zhao
Kai Wang
Xiangxiang Chu
Fuzhao Xue
Xinchao Wang
Yang You
29
21
0
06 Apr 2022
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
Mengjun Cheng
Yipeng Sun
Long Wang
Xiongwei Zhu
Kun Yao
...
Guoli Song
Junyu Han
Jingtuo Liu
Errui Ding
Jingdong Wang
24
60
0
31 Mar 2022
Parameter-efficient Model Adaptation for Vision Transformers
Xuehai He
Chunyuan Li
Pengchuan Zhang
Jianwei Yang
Qing Guo
30
84
0
29 Mar 2022
Sketching without Worrying: Noise-Tolerant Sketch-Based Image Retrieval
A. Bhunia
Subhadeep Koley
Abdullah Faiz Ur Rahman Khilji
Aneeshan Sain
Pinaki Nath Chowdhury
Tao Xiang
Yi-Zhe Song
AAML
22
42
0
28 Mar 2022
Two-stream Hierarchical Similarity Reasoning for Image-text Matching
Ran Chen
Hanli Wang
Lei Wang
Sam Kwong
13
9
0
10 Mar 2022
Where Does the Performance Improvement Come From? -- A Reproducibility Concern about Image-Text Retrieval
Jun Rao
Fei-Yue Wang
Liang Ding
Shuhan Qi
Yibing Zhan
Weifeng Liu
Dacheng Tao
OOD
42
28
0
08 Mar 2022
A Survey of Vision-Language Pre-Trained Models
Yifan Du
Zikang Liu
Junyi Li
Wayne Xin Zhao
VLM
33
179
0
18 Feb 2022
When Did It Happen? Duration-informed Temporal Localization of Narrated Actions in Vlogs
Oana Ignat
Santiago Castro
Yuhang Zhou
Jiajun Bao
Dandan Shan
Rada Mihalcea
18
3
0
16 Feb 2022
Do Lessons from Metric Learning Generalize to Image-Caption Retrieval?
Maurits J. R. Bleeker
Maarten de Rijke
SSL
DML
29
9
0
14 Feb 2022
Multi-Modal Knowledge Graph Construction and Application: A Survey
Xiangru Zhu
Zhixu Li
Xiaodan Wang
Xueyao Jiang
Penglei Sun
Xuwu Wang
Yanghua Xiao
N. Yuan
28
154
0
11 Feb 2022
Multimodal data matters: language model pre-training over structured and unstructured electronic health records
Sicen Liu
Xiaolong Wang
Yongshuai Hou
Ge Li
Hui Wang
Huiqin Xu
Yang Xiang
Buzhou Tang
52
30
0
25 Jan 2022
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training
Yehao Li
Jiahao Fan
Yingwei Pan
Ting Yao
Weiyao Lin
Tao Mei
MLLM
ObjD
33
19
0
11 Jan 2022
VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts
Longtian Qiu
Renrui Zhang
Ziyu Guo
Wei Zhang
Zilu Guo
Ziyao Zeng
Guangnan Zhang
VLM
CLIP
28
45
0
04 Dec 2021
Object-aware Video-language Pre-training for Retrieval
Alex Jinpeng Wang
Yixiao Ge
Guanyu Cai
Rui Yan
Xudong Lin
Ying Shan
Xiaohu Qie
Mike Zheng Shou
ViT
VLM
17
79
0
01 Dec 2021
EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching
Yaya Shi
Xu Yang
Haiyang Xu
Chunfen Yuan
Bing Li
Weiming Hu
Zhengjun Zha
39
33
0
17 Nov 2021
FILIP: Fine-grained Interactive Language-Image Pre-Training
Lewei Yao
Runhu Huang
Lu Hou
Guansong Lu
Minzhe Niu
Hang Xu
Xiaodan Liang
Zhenguo Li
Xin Jiang
Chunjing Xu
VLM
CLIP
30
615
0
09 Nov 2021
Instance-Conditional Knowledge Distillation for Object Detection
Zijian Kang
Peizhen Zhang
Xinming Zhang
Jian Sun
N. Zheng
24
76
0
25 Oct 2021
Multimodal Dialogue Response Generation
Qingfeng Sun
Yujing Wang
Can Xu
Kai Zheng
Yaming Yang
Huang Hu
Fei Xu
Jessica Zhang
Xiubo Geng
Daxin Jiang
23
43
0
16 Oct 2021
Topic Scene Graph Generation by Attention Distillation from Caption
Wenbin Wang
R. Wang
X. Chen
DiffM
19
14
0
12 Oct 2021
An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog
Xingyao Wang
David Jurgens
22
5
0
24 Sep 2021
Dense Contrastive Visual-Linguistic Pretraining
Lei Shi
Kai Shuang
Shijie Geng
Peng Gao
Zuohui Fu
Gerard de Melo
Yunpeng Chen
Sen Su
VLM
SSL
54
10
0
24 Sep 2021
Survey: Transformer based Video-Language Pre-training
Ludan Ruan
Qin Jin
VLM
ViT
72
44
0
21 Sep 2021
Cross Modification Attention Based Deliberation Model for Image Captioning
Zheng Lian
Yanan Zhang
Haichang Li
Rui Wang
Xiaohui Hu
24
4
0
17 Sep 2021
DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval
A. Zhu
Zijie Wang
Yifeng Li
Xili Wan
Jing Jin
Tian Wang
Fangqiang Hu
G. Hua
95
162
0
12 Sep 2021
Are Gender-Neutral Queries Really Gender-Neutral? Mitigating Gender Bias in Image Search
Jialu Wang
Yang Liu
Qing Guo
FaML
157
95
0
12 Sep 2021
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
Jianwei Yang
Yonatan Bisk
Jianfeng Gao
27
137
0
23 Aug 2021
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration
Yuhao Cui
Zhou Yu
Chunqi Wang
Zhongzhou Zhao
Ji Zhang
Meng Wang
Jun-chen Yu
VLM
27
53
0
16 Aug 2021
ICAF: Iterative Contrastive Alignment Framework for Multimodal Abstractive Summarization
Zijian Zhang
Chang Shu
Youxin Chen
Jing Xiao
Qian Zhang
Lu Zheng
23
5
0
11 Aug 2021
Previous
1
2
3
4
Next