Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1707.07998
Cited By
v1
v2
v3 (latest)
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
25 July 2017
Peter Anderson
Xiaodong He
Chris Buehler
Damien Teney
Mark Johnson
Stephen Gould
Lei Zhang
AIMat
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering"
50 / 1,868 papers shown
Title
Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia
K. Nguyen
Ali Furkan Biten
Andrés Mafla
Lluís Gómez
Dimosthenis Karatzas
56
11
0
21 Sep 2022
Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering
Hao Li
Jinfa Huang
Peng Jin
Guoli Song
Qi Wu
Jie Chen
141
22
0
21 Sep 2022
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Pan Lu
Swaroop Mishra
Tony Xia
Liang Qiu
Kai-Wei Chang
Song-Chun Zhu
Oyvind Tafjord
Peter Clark
Ashwin Kalyan
ELM
ReLM
LRM
295
1,301
0
20 Sep 2022
How to Adapt Pre-trained Vision-and-Language Models to a Text-only Input?
Lovisa Hagström
Richard Johansson
VLM
64
4
0
19 Sep 2022
ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding
Wenjin Wang
Zhengjie Huang
Bin Luo
Qianglong Chen
Qiming Peng
...
Weichong Yin
Shi Feng
Yu Sun
Dianhai Yu
Yin Zhang
ViT
76
13
0
18 Sep 2022
Overcoming Language Priors in Visual Question Answering via Distinguishing Superficially Similar Instances
Yike Wu
Yu Zhao
Shiwan Zhao
Ying Zhang
Xiaojie Yuan
Guoqing Zhao
Ning Jiang
117
19
0
18 Sep 2022
Learning Distinct and Representative Styles for Image Captioning
Qi Chen
Chaorui Deng
Qi Wu
VLM
75
24
0
17 Sep 2022
Belief Revision based Caption Re-ranker with Visual Semantic Information
Ahmed Sabir
Francesc Moreno-Noguer
Pranava Madhyastha
Lluís Padró
BDL
64
2
0
16 Sep 2022
Distribution Aware Metrics for Conditional Natural Language Generation
David M. Chan
Yiming Ni
David A. Ross
Sudheendra Vijayanarasimhan
Austin Myers
John F. Canny
77
4
0
15 Sep 2022
M^4I: Multi-modal Models Membership Inference
Pingyi Hu
Zihan Wang
Ruoxi Sun
Hu Wang
Minhui Xue
97
27
0
15 Sep 2022
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering
Jingjing Jiang
Zi-yi Liu
Nanning Zheng
89
8
0
14 Sep 2022
MUST-VQA: MUltilingual Scene-text VQA
Emanuele Vivoli
Ali Furkan Biten
Andrés Mafla
Dimosthenis Karatzas
Lluís Gómez
113
6
0
14 Sep 2022
PreSTU: Pre-Training for Scene-Text Understanding
Jihyung Kil
Soravit Changpinyo
Xi Chen
Hexiang Hu
Sebastian Goodman
Wei-Lun Chao
Radu Soricut
VLM
191
29
0
12 Sep 2022
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Tsu-Jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
William Yang Wang
Lijuan Wang
Zicheng Liu
VLM
130
65
0
04 Sep 2022
vieCap4H-VLSP 2021: Vietnamese Image Captioning for Healthcare Domain using Swin Transformer and Attention-based LSTM
THANH VAN NGUYEN
Long H. Nguyen
Nhat Truong Pham
Liu Tai Nguyen
Van Huong Do
Hai Nguyen
Ngoc Duy Nguyen
VLM
ViT
43
1
0
03 Sep 2022
Disentangle and Remerge: Interventional Knowledge Distillation for Few-Shot Object Detection from A Conditional Causal Perspective
Jiangmeng Li
Yanan Zhang
Jingyao Wang
Hui Xiong
Chengbo Jiao
Xiaohui Hu
Changwen Zheng
Gang Hua
CML
110
30
0
26 Aug 2022
AiM: Taking Answers in Mind to Correct Chinese Cloze Tests in Educational Applications
Yusen Zhang
Zhongli Li
Qingyu Zhou
Ziyi Liu
Chao Li
Mina W. Ma
Yunbo Cao
Hongzhi Liu
93
1
0
26 Aug 2022
Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task
Stan Weixian Lei
Difei Gao
Jay Zhangjie Wu
Yuxuan Wang
Wei Liu
Meng Zhang
Mike Zheng Shou
71
38
0
24 Aug 2022
Bidirectional Contrastive Split Learning for Visual Question Answering
Yuwei Sun
H. Ochiai
37
2
0
24 Aug 2022
FashionVQA: A Domain-Specific Visual Question Answering System
Min Wang
A. Mahjoubfar
Anupama Joshi
101
4
0
24 Aug 2022
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
...
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
MLLM
VLM
ViT
157
645
0
22 Aug 2022
A Medical Semantic-Assisted Transformer for Radiographic Report Generation
Zhanyu Wang
Mingkang Tang
Lei Wang
Xiu Li
Luping Zhou
ViT
MedIm
81
58
0
22 Aug 2022
GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement
Zhi-Qi Cheng
Qianwen Dai
Siyao Li
Teruko Mitamura
Alexander G. Hauptmann
70
37
0
18 Aug 2022
Multimodal foundation models are better simulators of the human brain
Haoyu Lu
Qiongyi Zhou
Nanyi Fei
Zhiwu Lu
Mingyu Ding
...
Changde Du
Xin Zhao
Haoran Sun
Huiguang He
J. Wen
AI4CE
85
13
0
17 Aug 2022
Understanding Attention for Vision-and-Language Tasks
Feiqi Cao
S. Han
Siqu Long
Changwei Xu
Josiah Poon
77
5
0
17 Aug 2022
Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning
J. Hu
Roberto Cavicchioli
Alessandro Capotondi
128
22
0
13 Aug 2022
Aesthetic Attributes Assessment of Images with AMANv2 and DPC-CaptionsV2
Xinghui Zhou
Xin Jin
Jianwen Lv
Heng Huang
Ming Mao
Shuai Cui
CoGe
46
0
0
09 Aug 2022
Distinctive Image Captioning via CLIP Guided Group Optimization
Youyuan Zhang
Jiuniu Wang
Hao Wu
Wenjia Xu
VLM
95
8
0
08 Aug 2022
ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding
Bingning Wang
Feiya Lv
Ting Yao
Yiming Yuan
Jin Ma
Yu Luo
Haijin Liang
68
3
0
05 Aug 2022
Fine-Grained Semantically Aligned Vision-Language Pre-Training
Juncheng Li
Xin He
Longhui Wei
Long Qian
Linchao Zhu
Lingxi Xie
Yueting Zhuang
Qi Tian
Siliang Tang
VLM
106
80
0
04 Aug 2022
TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation
Jun Wang
M. Gao
Yuqian Hu
Ramprasaath R. Selvaraju
Chetan Ramaiah
Ran Xu
Joseph Jaja
Larry S. Davis
ViT
72
18
0
03 Aug 2022
Generative Bias for Robust Visual Question Answering
Jae-Won Cho
Dong-Jin Kim
H. Ryu
In So Kweon
OOD
CML
100
20
0
01 Aug 2022
ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval
Nicola Messina
Matteo Stefanini
Marcella Cornia
Lorenzo Baraldi
Fabrizio Falchi
Giuseppe Amato
Rita Cucchiara
VLM
40
22
0
29 Jul 2022
Pro-tuning: Unified Prompt Tuning for Vision Tasks
Xing Nie
Bolin Ni
Jianlong Chang
Gaomeng Meng
Chunlei Huo
Zhaoxiang Zhang
Shiming Xiang
Qi Tian
Chunhong Pan
AAML
VPVLM
VLM
122
76
0
28 Jul 2022
Uncertainty-based Visual Question Answering: Estimating Semantic Inconsistency between Image and Knowledge Base
Jinyeong Chae
Jihie Kim
52
2
0
27 Jul 2022
Retrieval-Augmented Transformer for Image Captioning
Sara Sarto
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
88
59
0
26 Jul 2022
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
Zhuo Chen
Yufen Huang
Jiaoyan Chen
Yuxia Geng
Yin Fang
Jeff Z. Pan
Ningyu Zhang
Wen Zhang
95
38
0
26 Jul 2022
Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering
Yang Liu
Guanbin Li
Liang Lin
LRM
172
87
0
26 Jul 2022
Is GPT-3 all you need for Visual Question Answering in Cultural Heritage?
P. Bongini
Federico Becattini
A. Bimbo
44
13
0
25 Jul 2022
Visual Perturbation-aware Collaborative Learning for Overcoming the Language Prior Problem
Yudong Han
Liqiang Nie
Jianhua Yin
Jianlong Wu
Yan Yan
86
14
0
24 Jul 2022
Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations
Qian Yang
Yunxin Li
Baotian Hu
Lin Ma
Yuxin Ding
Min Zhang
91
10
0
23 Jul 2022
Rethinking the Reference-based Distinctive Image Captioning
Yangjun Mao
Long Chen
Zhihong Jiang
Dong Zhang
Zhimeng Zhang
Jian Shao
Jun Xiao
DiffM
83
22
0
22 Jul 2022
Efficient Modeling of Future Context for Image Captioning
Zhengcong Fei
Junshi Huang
Xiaoming Wei
Xiaolin K. Wei
76
15
0
22 Jul 2022
Semantic-aware Modular Capsule Routing for Visual Question Answering
Yudong Han
Jianhua Yin
Jianlong Wu
Yin-wei Wei
Liqiang Nie
62
8
0
21 Jul 2022
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features
Van-Quang Nguyen
Masanori Suganuma
Takayuki Okatani
ViT
84
114
0
20 Jul 2022
Explicit Image Caption Editing
Zhen Wang
Long Chen
Wenbo Ma
G. Han
Yulei Niu
Jian Shao
Jun Xiao
60
12
0
20 Jul 2022
Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification
Renrui Zhang
Zhang Wei
Rongyao Fang
Peng Gao
Kunchang Li
Jifeng Dai
Yu Qiao
Hongsheng Li
VLM
133
321
0
19 Jul 2022
Geometric Features Informed Multi-person Human-object Interaction Recognition in Videos
Tanqiu Qiao
Qianhui Men
Frederick W. B. Li
Yoshiki Kubotani
Shigeo Morishima
Hubert P. H. Shum
73
19
0
19 Jul 2022
Rethinking Data Augmentation for Robust Visual Question Answering
Long Chen
Yuhang Zheng
Jun Xiao
OOD
88
43
0
18 Jul 2022
Knowledge Guided Bidirectional Attention Network for Human-Object Interaction Detection
Jingjia Huang
Baixiang Yang
110
0
0
16 Jul 2022
Previous
1
2
3
...
12
13
14
...
36
37
38
Next