ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1803.08024
  4. Cited By
Stacked Cross Attention for Image-Text Matching

Stacked Cross Attention for Image-Text Matching

21 March 2018
Kuang-Huei Lee
Xi Chen
G. Hua
Houdong Hu
Xiaodong He
ArXivPDFHTML

Papers citing "Stacked Cross Attention for Image-Text Matching"

50 / 161 papers shown
Title
CLIP-Driven Fine-grained Text-Image Person Re-identification
CLIP-Driven Fine-grained Text-Image Person Re-identification
Shuanglin Yan
Neng Dong
Liyan Zhang
Jinhui Tang
39
87
0
19 Oct 2022
Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval
Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval
Xuri Ge
Fuhai Chen
Songpei Xu
Fuxiang Tao
J. Jose
30
26
0
17 Oct 2022
Multi-Granularity Cross-modal Alignment for Generalized Medical Visual
  Representation Learning
Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning
Fuying Wang
Yuyin Zhou
Shujun Wang
V. Vardhanabhuti
Lequan Yu
26
137
0
12 Oct 2022
Transformer-based Localization from Embodied Dialog with Large-scale
  Pre-training
Transformer-based Localization from Embodied Dialog with Large-scale Pre-training
Meera Hahn
James M. Rehg
LM&Ro
40
4
0
10 Oct 2022
Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced
  Hard Negatives for Cross-modal Information Retrieval
Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information Retrieval
Yan Gong
Georgina Cosma
27
11
0
10 Oct 2022
Unified Loss of Pair Similarity Optimization for Vision-Language
  Retrieval
Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval
Zheng Li
Caili Guo
Xin Wang
Zerun Feng
Jenq-Neng Hwang
Zhongtian Du
VLM
24
2
0
28 Sep 2022
TokenFlow: Rethinking Fine-grained Cross-modal Alignment in
  Vision-Language Retrieval
TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval
Xiaohan Zou
Changqiao Wu
Lele Cheng
Zhongyuan Wang
94
6
0
28 Sep 2022
ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document
  Understanding
ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding
Wenjin Wang
Zhengjie Huang
Bin Luo
Qianglong Chen
Qiming Peng
...
Weichong Yin
Shi Feng
Yu Sun
Dianhai Yu
Yin Zhang
ViT
30
11
0
18 Sep 2022
CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for
  Image-Text Retrieval
CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval
Haoran Wang
Dongliang He
Wenhao Wu
Boyang Xia
Min Yang
Fu Li
Yunlong Yu
Zhong Ji
Errui Ding
Jingdong Wang
30
22
0
21 Aug 2022
Vision-Language Matching for Text-to-Image Synthesis via Generative
  Adversarial Networks
Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks
Qingrong Cheng
Keyu Wen
X. Gu
VLM
EGVM
32
16
0
20 Aug 2022
See Finer, See More: Implicit Modality Alignment for Text-based Person
  Retrieval
See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval
Xiujun Shu
Wei Wen
Haoqian Wu
Keyun Chen
Yi-Zhe Song
Ruizhi Qiao
Bohan Ren
Xiao Wang
24
91
0
18 Aug 2022
CORNET: Learning Table Formatting Rules By Example
CORNET: Learning Table Formatting Rules By Example
Mukul Singh
J. Cambronero
Sumit Gulwani
Vu Le
Carina Negreanu
Mohammad Raza
Gust Verbruggen
LMTD
56
8
0
11 Aug 2022
A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch
A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch
Patsorn Sangkloy
Wittawat Jitkrittum
Diyi Yang
James Hays
3DV
23
32
0
05 Aug 2022
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text
  Retrieval
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Ming Yan
Ji Zhang
Rongrong Ji
CLIP
VLM
25
269
0
15 Jul 2022
Automatic Generation of Product-Image Sequence in E-commerce
Automatic Generation of Product-Image Sequence in E-commerce
Xiaochuan Fan
Chi Zhang
Yong-Jie Yang
Yue Shang
Xueying Zhang
Zhen He
Yun Xiao
Bo Long
Lingfei Wu
23
4
0
26 Jun 2022
VD-PCR: Improving Visual Dialog with Pronoun Coreference Resolution
VD-PCR: Improving Visual Dialog with Pronoun Coreference Resolution
Xintong Yu
Hongming Zhang
Ruixin Hong
Yangqiu Song
Changshui Zhang
17
13
0
29 May 2022
Mutual Information Divergence: A Unified Metric for Multimodal
  Generative Models
Mutual Information Divergence: A Unified Metric for Multimodal Generative Models
Jin-Hwa Kim
Yunji Kim
Jiyoung Lee
Kang Min Yoo
Sang-Woo Lee
EGVM
36
32
0
25 May 2022
HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text
  Retrieval
HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval
Feilong Chen
Xiuyi Chen
Jiaxin Shi
Duzhen Zhang
Jianlong Chang
Qi Tian
VLM
CLIP
34
6
0
24 May 2022
Improving Visual Grounding with Visual-Linguistic Verification and
  Iterative Reasoning
Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning
Li Yang
Yan Xu
Chunfen Yuan
Wei Liu
Bing Li
Weiming Hu
ObjD
50
113
0
30 Apr 2022
An Overview of Recent Work in Media Forensics: Methods and Threats
An Overview of Recent Work in Media Forensics: Methods and Threats
Kratika Bhagtani
A. Yadav
Emily R. Bartusiak
Ziyue Xiang
Ruiting Shao
Sriram Baireddy
Edward J. Delp
AAML
47
25
0
26 Apr 2022
Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote
  Sensing Image Retrieval
Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval
Zhiqiang Yuan
Wenkai Zhang
Kun Fu
Xuan Li
Chubo Deng
Hongqi Wang
Xian Sun
29
129
0
21 Apr 2022
On Distinctive Image Captioning via Comparing and Reweighting
On Distinctive Image Captioning via Comparing and Reweighting
Jiuniu Wang
Wenjia Xu
Qingzhong Wang
Antoni B. Chan
38
16
0
08 Apr 2022
Modeling Motion with Multi-Modal Features for Text-Based Video
  Segmentation
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation
Wangbo Zhao
Kai Wang
Xiangxiang Chu
Fuzhao Xue
Xinchao Wang
Yang You
29
21
0
06 Apr 2022
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
Mengjun Cheng
Yipeng Sun
Long Wang
Xiongwei Zhu
Kun Yao
...
Guoli Song
Junyu Han
Jingtuo Liu
Errui Ding
Jingdong Wang
24
60
0
31 Mar 2022
Parameter-efficient Model Adaptation for Vision Transformers
Parameter-efficient Model Adaptation for Vision Transformers
Xuehai He
Chunyuan Li
Pengchuan Zhang
Jianwei Yang
Qing Guo
30
84
0
29 Mar 2022
Sketching without Worrying: Noise-Tolerant Sketch-Based Image Retrieval
Sketching without Worrying: Noise-Tolerant Sketch-Based Image Retrieval
A. Bhunia
Subhadeep Koley
Abdullah Faiz Ur Rahman Khilji
Aneeshan Sain
Pinaki Nath Chowdhury
Tao Xiang
Yi-Zhe Song
AAML
22
42
0
28 Mar 2022
Two-stream Hierarchical Similarity Reasoning for Image-text Matching
Two-stream Hierarchical Similarity Reasoning for Image-text Matching
Ran Chen
Hanli Wang
Lei Wang
Sam Kwong
13
9
0
10 Mar 2022
Where Does the Performance Improvement Come From? -- A Reproducibility
  Concern about Image-Text Retrieval
Where Does the Performance Improvement Come From? -- A Reproducibility Concern about Image-Text Retrieval
Jun Rao
Fei-Yue Wang
Liang Ding
Shuhan Qi
Yibing Zhan
Weifeng Liu
Dacheng Tao
OOD
42
28
0
08 Mar 2022
A Survey of Vision-Language Pre-Trained Models
A Survey of Vision-Language Pre-Trained Models
Yifan Du
Zikang Liu
Junyi Li
Wayne Xin Zhao
VLM
33
179
0
18 Feb 2022
When Did It Happen? Duration-informed Temporal Localization of Narrated
  Actions in Vlogs
When Did It Happen? Duration-informed Temporal Localization of Narrated Actions in Vlogs
Oana Ignat
Santiago Castro
Yuhang Zhou
Jiajun Bao
Dandan Shan
Rada Mihalcea
18
3
0
16 Feb 2022
Do Lessons from Metric Learning Generalize to Image-Caption Retrieval?
Do Lessons from Metric Learning Generalize to Image-Caption Retrieval?
Maurits J. R. Bleeker
Maarten de Rijke
SSL
DML
29
9
0
14 Feb 2022
Multi-Modal Knowledge Graph Construction and Application: A Survey
Multi-Modal Knowledge Graph Construction and Application: A Survey
Xiangru Zhu
Zhixu Li
Xiaodan Wang
Xueyao Jiang
Penglei Sun
Xuwu Wang
Yanghua Xiao
N. Yuan
28
154
0
11 Feb 2022
Multimodal data matters: language model pre-training over structured and
  unstructured electronic health records
Multimodal data matters: language model pre-training over structured and unstructured electronic health records
Sicen Liu
Xiaolong Wang
Yongshuai Hou
Ge Li
Hui Wang
Huiqin Xu
Yang Xiang
Buzhou Tang
52
30
0
25 Jan 2022
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular
  Vision-Language Pre-training
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training
Yehao Li
Jiahao Fan
Yingwei Pan
Ting Yao
Weiyao Lin
Tao Mei
MLLM
ObjD
33
19
0
11 Jan 2022
VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts
VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts
Longtian Qiu
Renrui Zhang
Ziyu Guo
Wei Zhang
Zilu Guo
Ziyao Zeng
Guangnan Zhang
VLM
CLIP
28
45
0
04 Dec 2021
Object-aware Video-language Pre-training for Retrieval
Object-aware Video-language Pre-training for Retrieval
Alex Jinpeng Wang
Yixiao Ge
Guanyu Cai
Rui Yan
Xudong Lin
Ying Shan
Xiaohu Qie
Mike Zheng Shou
ViT
VLM
17
79
0
01 Dec 2021
EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained
  Embedding Matching
EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching
Yaya Shi
Xu Yang
Haiyang Xu
Chunfen Yuan
Bing Li
Weiming Hu
Zhengjun Zha
39
33
0
17 Nov 2021
FILIP: Fine-grained Interactive Language-Image Pre-Training
FILIP: Fine-grained Interactive Language-Image Pre-Training
Lewei Yao
Runhu Huang
Lu Hou
Guansong Lu
Minzhe Niu
Hang Xu
Xiaodan Liang
Zhenguo Li
Xin Jiang
Chunjing Xu
VLM
CLIP
30
615
0
09 Nov 2021
Instance-Conditional Knowledge Distillation for Object Detection
Instance-Conditional Knowledge Distillation for Object Detection
Zijian Kang
Peizhen Zhang
Xinming Zhang
Jian Sun
N. Zheng
24
76
0
25 Oct 2021
Multimodal Dialogue Response Generation
Multimodal Dialogue Response Generation
Qingfeng Sun
Yujing Wang
Can Xu
Kai Zheng
Yaming Yang
Huang Hu
Fei Xu
Jessica Zhang
Xiubo Geng
Daxin Jiang
23
43
0
16 Oct 2021
Topic Scene Graph Generation by Attention Distillation from Caption
Topic Scene Graph Generation by Attention Distillation from Caption
Wenbin Wang
R. Wang
X. Chen
DiffM
19
14
0
12 Oct 2021
An animated picture says at least a thousand words: Selecting Gif-based
  Replies in Multimodal Dialog
An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog
Xingyao Wang
David Jurgens
22
5
0
24 Sep 2021
Dense Contrastive Visual-Linguistic Pretraining
Dense Contrastive Visual-Linguistic Pretraining
Lei Shi
Kai Shuang
Shijie Geng
Peng Gao
Zuohui Fu
Gerard de Melo
Yunpeng Chen
Sen Su
VLM
SSL
54
10
0
24 Sep 2021
Survey: Transformer based Video-Language Pre-training
Survey: Transformer based Video-Language Pre-training
Ludan Ruan
Qin Jin
VLM
ViT
72
44
0
21 Sep 2021
Cross Modification Attention Based Deliberation Model for Image
  Captioning
Cross Modification Attention Based Deliberation Model for Image Captioning
Zheng Lian
Yanan Zhang
Haichang Li
Rui Wang
Xiaohui Hu
24
4
0
17 Sep 2021
DSSL: Deep Surroundings-person Separation Learning for Text-based Person
  Retrieval
DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval
A. Zhu
Zijie Wang
Yifeng Li
Xili Wan
Jing Jin
Tian Wang
Fangqiang Hu
G. Hua
95
162
0
12 Sep 2021
Are Gender-Neutral Queries Really Gender-Neutral? Mitigating Gender Bias
  in Image Search
Are Gender-Neutral Queries Really Gender-Neutral? Mitigating Gender Bias in Image Search
Jialu Wang
Yang Liu
Qing Guo
FaML
157
95
0
12 Sep 2021
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
Jianwei Yang
Yonatan Bisk
Jianfeng Gao
27
137
0
23 Aug 2021
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and
  Intra-modal Knowledge Integration
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration
Yuhao Cui
Zhou Yu
Chunqi Wang
Zhongzhou Zhao
Ji Zhang
Meng Wang
Jun-chen Yu
VLM
27
53
0
16 Aug 2021
ICAF: Iterative Contrastive Alignment Framework for Multimodal
  Abstractive Summarization
ICAF: Iterative Contrastive Alignment Framework for Multimodal Abstractive Summarization
Zijian Zhang
Chang Shu
Youxin Chen
Jing Xiao
Qian Zhang
Lu Zheng
23
5
0
11 Aug 2021
Previous
1234
Next