Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1707.07998
Cited By
v1
v2
v3 (latest)
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
25 July 2017
Peter Anderson
Xiaodong He
Chris Buehler
Damien Teney
Mark Johnson
Stephen Gould
Lei Zhang
AIMat
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering"
50 / 1,868 papers shown
Title
Visually Grounded VQA by Lattice-based Retrieval
Daniel Reich
F. Putze
Tanja Schultz
45
2
0
15 Nov 2022
Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment
Junyan Wang
Yi Zhang
Ming Yan
Ji Zhang
Jitao Sang
VLM
55
9
0
14 Nov 2022
VieCap4H-VLSP 2021: ObjectAoA-Enhancing performance of Object Relation Transformer with Attention on Attention for Vietnamese image captioning
Nghia Hieu Nguyen
Duong T.D. Vo
Minh-Quan Ha
ViT
48
1
0
10 Nov 2022
Towards Reasoning-Aware Explainable VQA
Rakesh Vaideeswaran
Feng Gao
Abhinav Mathur
Govind Thattai
LRM
83
3
0
09 Nov 2022
Portmanteauing Features for Scene Text Recognition
Yew Lee Tan
Ernest Yu Kai Chew
A. Kong
Jung-jae Kim
J. Lim
70
0
0
09 Nov 2022
ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation
Bin Shan
Yaqian Han
Weichong Yin
Shuohuan Wang
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
MLLM
VLM
88
8
0
09 Nov 2022
Adaptive Contrastive Learning on Multimodal Transformer for Review Helpfulness Predictions
Thong Nguyen
Xiaobao Wu
Anh Tuan Luu
Cong-Duy Nguyen
Zhen Hai
Lidong Bing
82
13
0
07 Nov 2022
CLOP: Video-and-Language Pre-Training with Knowledge Regularizations
Guohao Li
Hu Yang
Feng He
Zhifan Feng
Yajuan Lyu
Hua Wu
Haifeng Wang
VLM
45
1
0
07 Nov 2022
OSIC: A New One-Stage Image Captioner Coined
Bo Wang
Zhao Zhang
Ming Zhao
Xiaojie Jin
Mingliang Xu
Meng Wang
VLM
74
4
0
04 Nov 2022
CAMANet: Class Activation Map Guided Attention Network for Radiology Report Generation
Jun Wang
A. Bhalerao
Terry Yin
Simon See
Yulan He
MedIm
78
18
0
02 Nov 2022
Text-Only Training for Image Captioning using Noise-Injected CLIP
David Nukrai
Ron Mokady
Amir Globerson
VLM
CLIP
138
98
0
01 Nov 2022
Training Vision-Language Models with Less Bimodal Supervision
Elad Segal
Ben Bogin
Jonathan Berant
VLM
48
2
0
01 Nov 2022
What's Different between Visual Question Answering for Machine "Understanding" Versus for Accessibility?
Yang Trista Cao
Kyle Seelman
Kyungjun Lee
Hal Daumé
41
5
0
26 Oct 2022
Visual Semantic Parsing: From Images to Abstract Meaning Representation
M. A. Abdelsalam
Zhan Shi
Federico Fancellu
Kalliopi Basioti
Dhaivat Bhatt
Vladimir Pavlovic
Afsaneh Fazly
GNN
85
4
0
26 Oct 2022
Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering
Q. Si
Yuanxin Liu
Zheng Lin
Peng Fu
Weiping Wang
VLM
117
1
0
26 Oct 2022
Multilingual Multimodal Learning with Machine Translated Text
Chen Qiu
Dan Oneaţă
Emanuele Bugliarello
Stella Frank
Desmond Elliott
121
15
0
24 Oct 2022
Dissecting Deep Metric Learning Losses for Image-Text Retrieval
Hong Xuan
Xi Chen
66
2
0
21 Oct 2022
Image-Text Retrieval with Binary and Continuous Label Supervision
Zheng Li
Caili Guo
Zerun Feng
Lei Li
Ying Jin
Yufeng Zhang
VLM
71
4
0
20 Oct 2022
A Survey of Computer Vision Technologies In Urban and Controlled-environment Agriculture
Jiayun Luo
Boyang Albert Li
Cyril Leung
144
15
0
20 Oct 2022
Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation
Yu Zhao
Jianguo Wei
Zhichao Lin
Yueheng Sun
Meishan Zhang
Hao Fei
68
16
0
20 Oct 2022
Grounded Video Situation Recognition
Zeeshan Khan
C. V. Jawahar
Makarand Tapaswi
92
14
0
19 Oct 2022
CPL: Counterfactual Prompt Learning for Vision and Language Models
Xuehai He
Diji Yang
Weixi Feng
Tsu-Jui Fu
Arjun Reddy Akula
Varun Jampani
P. Narayana
Sugato Basu
William Yang Wang
Xinze Wang
VPVLM
VLM
94
15
0
19 Oct 2022
Probing Cross-modal Semantics Alignment Capability from the Textual Perspective
Zheng Ma
Shi Zong
Mianzhi Pan
Jianbing Zhang
Shujian Huang
Xinyu Dai
Jiajun Chen
54
4
0
18 Oct 2022
CNT (Conditioning on Noisy Targets): A new Algorithm for Leveraging Top-Down Feedback
Alexia Jolicoeur-Martineau
Alex Lamb
Vikas Verma
Aniket Didolkar
NoLa
30
0
0
18 Oct 2022
Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval
Xuri Ge
Fuhai Chen
Songpei Xu
Fuxiang Tao
J. Jose
57
26
0
17 Oct 2022
Novel 3D Scene Understanding Applications From Recurrence in a Single Image
Shimian Zhang
Skanda Bharadwaj
Keaton Kraiger
Yashasvi Asthana
Hong Zhang
R. Collins
Yanxi Liu
121
1
0
14 Oct 2022
Hybrid Reinforced Medical Report Generation with M-Linear Attention and Repetition Penalty
Wenting Xu
Zhenghua Xu
Junyang Chen
Chang Qi
Thomas Lukasiewicz
MedIm
62
8
0
14 Oct 2022
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning
Tiannan Wang
Wangchunshu Zhou
Yan Zeng
Xinsong Zhang
VLM
82
44
0
14 Oct 2022
Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training
Wenliang Dai
Zihan Liu
Ziwei Ji
Dan Su
Pascale Fung
MLLM
VLM
86
67
0
14 Oct 2022
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks
Gregor Geigle
Chen Cecilia Liu
Jonas Pfeiffer
Iryna Gurevych
VLM
50
1
0
12 Oct 2022
Learning by Asking Questions for Knowledge-based Novel Object Recognition
Kohei Uehara
Tatsuya Harada
24
1
0
12 Oct 2022
Generating image captions with external encyclopedic knowledge
S. Nikiforova
Tejaswini Deoskar
Denis Paperno
Yoad Winter
72
2
0
10 Oct 2022
Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information Retrieval
Yan Gong
Georgina Cosma
71
11
0
10 Oct 2022
Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning
Q. Si
Yuanxin Liu
Fandong Meng
Zheng Lin
Peng Fu
Yanan Cao
Weiping Wang
Jie Zhou
88
24
0
10 Oct 2022
Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling
Hsin-Ying Lee
Hung-Ting Su
Bing-Chen Tsai
Tsung-Han Wu
Jia-Fong Yeh
Winston H. Hsu
86
2
0
08 Oct 2022
Contextual Modeling for 3D Dense Captioning on Point Clouds
Yufeng Zhong
Longdao Xu
Jiebo Luo
Lin Ma
85
15
0
08 Oct 2022
Video Referring Expression Comprehension via Transformer with Content-aware Query
Ji Jiang
Meng Cao
Tengtao Song
Yuexian Zou
83
5
0
06 Oct 2022
VLSNR:Vision-Linguistics Coordination Time Sequence-aware News Recommendation
Songhao Han
Wei-Ping Huang
Xiaotian Luan Beihang University
AI4TS
74
3
0
06 Oct 2022
AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation
Khoa T. Vo
Sang Truong
Kashu Yamazaki
Bhiksha Raj
Minh-Triet Tran
Ngan Le
156
30
0
05 Oct 2022
Vision+X: A Survey on Multimodal Learning in the Light of Data
Ye Zhu
Yuehua Wu
N. Sebe
Yan Yan
105
19
0
05 Oct 2022
Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning
Xu Yang
Hanwang Zhang
Chongyang Gao
Jianfei Cai
MLLM
81
10
0
04 Oct 2022
Extending Compositional Attention Networks for Social Reasoning in Videos
Christina Sartzetaki
Georgios Paraskevopoulos
Alexandros Potamianos
LRM
43
3
0
03 Oct 2022
A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering
Xiaofei Huang
Hongfang Gong
MedIm
106
14
0
01 Oct 2022
Task Formulation Matters When Learning Continually: A Case Study in Visual Question Answering
Mavina Nikandrou
Lu Yu
Alessandro Suglia
Ioannis Konstas
Verena Rieser
OOD
76
5
0
30 Sep 2022
SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation
R. Ramos
Bruno Martins
Desmond Elliott
Yova Kementchedjhieva
VLM
89
89
0
30 Sep 2022
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
Bin Shan
Weichong Yin
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
VLM
75
19
0
30 Sep 2022
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding
Fengyuan Shi
Ruopeng Gao
Weilin Huang
Limin Wang
105
28
0
28 Sep 2022
Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval
Zheng Li
Caili Guo
Xin Eric Wang
Zerun Feng
Lei Li
Zhongtian Du
VLM
70
2
0
28 Sep 2022
A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective
Chaoqi Chen
Yushuang Wu
Qiyuan Dai
Hong-Yu Zhou
Mutian Xu
Sibei Yang
Xiaoguang Han
Yizhou Yu
ViT
MedIm
AI4CE
137
80
0
27 Sep 2022
Word to Sentence Visual Semantic Similarity for Caption Generation: Lessons Learned
Ahmed Sabir
119
0
0
26 Sep 2022
Previous
1
2
3
...
11
12
13
...
36
37
38
Next