Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1707.07998
Cited By
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
25 July 2017
Peter Anderson
Xiaodong He
Chris Buehler
Damien Teney
Mark Johnson
Stephen Gould
Lei Zhang
AIMat
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering"
50 / 789 papers shown
Title
RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning
Xiaojian Ma
Weili Nie
Zhiding Yu
Huaizu Jiang
Chaowei Xiao
Yuke Zhu
Song-Chun Zhu
Anima Anandkumar
ViT
LRM
30
19
0
24 Apr 2022
Training and challenging models for text-guided fashion image retrieval
Eric Dodds
Jack Culpepper
Gaurav Srivastava
23
8
0
23 Apr 2022
Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval
Zhiqiang Yuan
Wenkai Zhang
Kun Fu
Xuan Li
Chubo Deng
Hongqi Wang
Xian Sun
29
132
0
21 Apr 2022
Attention Mechanism based Cognition-level Scene Understanding
Xuejiao Tang
Tai Le Quy
LRM
35
0
0
17 Apr 2022
Guiding Attention using Partial-Order Relationships for Image Captioning
Murad Popattia
Muhammad Rafi
Rizwan Qureshi
Shah Nawaz
21
4
0
15 Apr 2022
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
Haoyu Lu
Nanyi Fei
Yuqi Huo
Yizhao Gao
Zhiwu Lu
Jiaxin Wen
CLIP
VLM
27
55
0
15 Apr 2022
A Call for Clarity in Beam Search: How It Works and When It Stops
Jungo Kasai
Keisuke Sakaguchi
Ronan Le Bras
Dragomir R. Radev
Yejin Choi
Noah A. Smith
28
6
0
11 Apr 2022
On Distinctive Image Captioning via Comparing and Reweighting
Jiuniu Wang
Wenjia Xu
Qingzhong Wang
Antoni B. Chan
48
16
0
08 Apr 2022
ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO
Sanghyuk Chun
Wonjae Kim
Song Park
Minsuk Chang
Seong Joon Oh
VLM
378
43
0
07 Apr 2022
OccamNets: Mitigating Dataset Bias by Favoring Simpler Hypotheses
Robik Shrestha
Kushal Kafle
Christopher Kanan
CML
38
13
0
05 Apr 2022
CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations
Leonard Salewski
A. Sophia Koepke
Hendrik P. A. Lensch
Zeynep Akata
LRM
NAI
33
20
0
05 Apr 2022
Attribute Prototype Network for Any-Shot Learning
Wenjia Xu
Yongqin Xian
Jiuniu Wang
Bernt Schiele
Zeynep Akata
VLM
37
37
0
04 Apr 2022
Question-Driven Graph Fusion Network For Visual Question Answering
Yuxi Qian
Yuncong Hu
Ruonan Wang
Fangxiang Feng
Xiaojie Wang
GNN
21
10
0
03 Apr 2022
Co-VQA : Answering by Interactive Sub Question Sequence
Ruonan Wang
Yuxi Qian
Fangxiang Feng
Xiaojie Wang
Huixing Jiang
LRM
29
16
0
02 Apr 2022
Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding
Jiabo Ye
Junfeng Tian
Ming Yan
Xiaoshan Yang
Xuwu Wang
Ji Zhang
Liang He
Xin Lin
ObjD
19
61
0
29 Mar 2022
Quantifying Societal Bias Amplification in Image Captioning
Yusuke Hirota
Yuta Nakashima
Noa Garcia
24
48
0
29 Mar 2022
End-to-End Transformer Based Model for Image Captioning
Yiyu Wang
Jungang Xu
Yingfei Sun
VLM
ViT
26
117
0
29 Mar 2022
A General Survey on Attention Mechanisms in Deep Learning
Gianni Brauwers
Flavius Frasincar
31
298
0
27 Mar 2022
MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering
Yang Ding
Jing Yu
Bangchang Liu
Yue Hu
Mingxin Cui
Qi Wu
13
62
0
17 Mar 2022
UNIMO-2: End-to-End Unified Vision-Language Grounded Learning
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua Wu
Haifeng Wang
MLLM
19
21
0
17 Mar 2022
Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding
Haojun Jiang
Yuanze Lin
Dongchen Han
Shiji Song
Gao Huang
ObjD
48
51
0
16 Mar 2022
Spot the Difference: A Cooperative Object-Referring Game in Non-Perfectly Co-Observable Scene
Duo Zheng
Fandong Meng
Q. Si
Hairun Fan
Zipeng Xu
Jie Zhou
Fangxiang Feng
Xiaojie Wang
27
0
0
16 Mar 2022
Extracting associations and meanings of objects depicted in artworks through bi-modal deep networks
Gregory Kell
Ryan-Rhys Griffiths
Anthony Bourached
D. Stork
27
3
0
14 Mar 2022
Global2Local: A Joint-Hierarchical Attention for Video Captioning
Chengpeng Dai
Fuhai Chen
Xiaoshuai Sun
Rongrong Ji
QiXiang Ye
Yongjian Wu
22
1
0
13 Mar 2022
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation
Wenliang Dai
Lu Hou
Lifeng Shang
Xin Jiang
Qun Liu
Pascale Fung
VLM
22
90
0
12 Mar 2022
REX: Reasoning-aware and Grounded Explanation
Shi Chen
Qi Zhao
25
18
0
11 Mar 2022
Two-stream Hierarchical Similarity Reasoning for Image-text Matching
Ran Chen
Hanli Wang
Lei Wang
Sam Kwong
21
9
0
10 Mar 2022
Knowledge-enriched Attention Network with Group-wise Semantic for Visual Storytelling
Tengpeng Li
Hanli Wang
Bin He
Changan Chen
DiffM
27
9
0
10 Mar 2022
NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
Fawaz Sammani
Tanmoy Mukherjee
Nikos Deligiannis
MILM
ELM
LRM
24
67
0
09 Mar 2022
AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant
B. Wong
Joya Chen
You Wu
Stan Weixian Lei
Dongxing Mao
Difei Gao
Mike Zheng Shou
EgoV
35
27
0
08 Mar 2022
Where Does the Performance Improvement Come From? -- A Reproducibility Concern about Image-Text Retrieval
Jun Rao
Fei Wang
Liang Ding
Shuhan Qi
Yibing Zhan
Weifeng Liu
Dacheng Tao
OOD
42
28
0
08 Mar 2022
GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes Prediction
Kareem M. Metwaly
Aerin Kim
E. Branson
V. Monga
43
7
0
07 Mar 2022
Modeling Coreference Relations in Visual Dialog
Mingxiao Li
Marie-Francine Moens
19
9
0
06 Mar 2022
Dynamic Key-value Memory Enhanced Multi-step Graph Reasoning for Knowledge-based Visual Question Answering
Mingxiao Li
Marie-Francine Moens
17
12
0
06 Mar 2022
Important Object Identification with Semi-Supervised Learning for Autonomous Driving
Jiachen Li
Haiming Gang
Hengbo Ma
Masayoshi Tomizuka
Chiho Choi
32
12
0
05 Mar 2022
Video Question Answering: Datasets, Algorithms and Challenges
Yaoyao Zhong
Junbin Xiao
Wei Ji
Yicong Li
Wei Deng
Tat-Seng Chua
32
87
0
02 Mar 2022
Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment
Mingyang Zhou
Licheng Yu
Amanpreet Singh
Mengjiao MJ Wang
Zhou Yu
Ning Zhang
VLM
33
31
0
01 Mar 2022
Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation
Shizhe Chen
Pierre-Louis Guhur
Makarand Tapaswi
Cordelia Schmid
Ivan Laptev
LM&Ro
36
139
0
23 Feb 2022
CaMEL: Mean Teacher Learning for Image Captioning
Manuele Barraco
Matteo Stefanini
Marcella Cornia
S. Cascianelli
Lorenzo Baraldi
Rita Cucchiara
ViT
VLM
43
27
0
21 Feb 2022
VLP: A Survey on Vision-Language Pre-training
Feilong Chen
Duzhen Zhang
Minglun Han
Xiuyi Chen
Jing Shi
Shuang Xu
Bo Xu
VLM
82
213
0
18 Feb 2022
When Did It Happen? Duration-informed Temporal Localization of Narrated Actions in Vlogs
Oana Ignat
Santiago Castro
Yuhang Zhou
Jiajun Bao
Dandan Shan
Rada Mihalcea
18
3
0
16 Feb 2022
Delving Deeper into Cross-lingual Visual Question Answering
Chen Cecilia Liu
Jonas Pfeiffer
Anna Korhonen
Ivan Vulić
Iryna Gurevych
37
8
0
15 Feb 2022
ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer
Kohei Uehara
Yusuke Mori
Yusuke Mukuta
Tatsuya Harada
30
6
0
15 Feb 2022
Multi-Modal Knowledge Graph Construction and Application: A Survey
Xiangru Zhu
Zhixu Li
Xiaodan Wang
Xueyao Jiang
Penglei Sun
Xuwu Wang
Yanghua Xiao
N. Yuan
41
154
0
11 Feb 2022
ACORT: A Compact Object Relation Transformer for Parameter Efficient Image Captioning
J. Tan
Y. Tan
C. Chan
Joon Huang Chuah
VLM
ViT
31
15
0
11 Feb 2022
Lightweight Jet Reconstruction and Identification as an Object Detection Task
Adrian Alan Pol
T. Aarrestad
E. Govorkova
Roi Halily
Anat Klempner
...
Vladimir Loncar
J. Ngadiuba
M. Pierini
Olya Sirkin
S. Summers
32
2
0
09 Feb 2022
Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling
Puyuan Peng
David Harwath
SSL
43
26
0
07 Feb 2022
Webly Supervised Concept Expansion for General Purpose Vision Models
Amita Kamath
Christopher Clark
Tanmay Gupta
Eric Kolve
Derek Hoiem
Aniruddha Kembhavi
VLM
35
54
0
04 Feb 2022
Deep Learning Approaches on Image Captioning: A Review
Taraneh Ghandi
H. Pourreza
H. Mahyar
VLM
30
89
0
31 Jan 2022
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages
Emanuele Bugliarello
Fangyu Liu
Jonas Pfeiffer
Siva Reddy
Desmond Elliott
Edoardo Ponti
Ivan Vulić
MLLM
VLM
ELM
50
62
0
27 Jan 2022
Previous
1
2
3
...
5
6
7
...
14
15
16
Next