ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1707.07998
  4. Cited By
Bottom-Up and Top-Down Attention for Image Captioning and Visual
  Question Answering
v1v2v3 (latest)

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

25 July 2017
Peter Anderson
Xiaodong He
Chris Buehler
Damien Teney
Mark Johnson
Stephen Gould
Lei Zhang
    AIMat
ArXiv (abs)PDFHTML

Papers citing "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering"

50 / 1,868 papers shown
Title
Multi-modal Feature Fusion with Feature Attention for VATEX Captioning
  Challenge 2020
Multi-modal Feature Fusion with Feature Attention for VATEX Captioning Challenge 2020
Ke Lin
Zhuoxin Gan
Liwei Wang
23
8
0
05 Jun 2020
Pick-Object-Attack: Type-Specific Adversarial Attack for Object
  Detection
Pick-Object-Attack: Type-Specific Adversarial Attack for Object Detection
Omid Mohamad Nezami
Akshay Chaturvedi
Mark Dras
Utpal Garain
AAMLObjD
61
19
0
05 Jun 2020
Emergent Multi-Agent Communication in the Deep Learning Era
Emergent Multi-Agent Communication in the Deep Learning Era
Angeliki Lazaridou
Marco Baroni
AI4CE
153
206
0
03 Jun 2020
Multimodal grid features and cell pointers for Scene Text Visual
  Question Answering
Multimodal grid features and cell pointers for Scene Text Visual Question Answering
Lluís Gómez
Ali Furkan Biten
Rubèn Pérez Tito
Andrés Mafla
Marçal Rusiñol
Ernest Valveny
Dimosthenis Karatzas
58
21
0
01 Jun 2020
Structured Multimodal Attentions for TextVQA
Structured Multimodal Attentions for TextVQA
Chenyu Gao
Qi Zhu
Peng Wang
Hui Li
Yuliang Liu
Anton Van Den Hengel
Qi Wu
99
60
0
01 Jun 2020
Controlling Length in Image Captioning
Controlling Length in Image Captioning
Ruotian Luo
G. Shakhnarovich
VLM
99
3
0
29 May 2020
TRIE: End-to-End Text Reading and Information Extraction for Document
  Understanding
TRIE: End-to-End Text Reading and Information Extraction for Document Understanding
Peng Zhang
Yunlu Xu
Zhanzhan Cheng
Shiliang Pu
Jing Lu
Liang Qiao
Yi Niu
Leilei Gan
SyDa
95
103
0
27 May 2020
FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal
  Retrieval
FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval
D. Gao
Linbo Jin
Ben Chen
Minghui Qiu
Peng Li
Yi Wei
Yitao Hu
Haozhe Jasper Wang
OOD
84
134
0
20 May 2020
Atss-Net: Target Speaker Separation via Attention-based Neural Network
Atss-Net: Target Speaker Separation via Attention-based Neural Network
Tingle Li
Qingjian Lin
Yuanyuan Bao
Ming Li
39
38
0
19 May 2020
Visual Relationship Detection using Scene Graphs: A Survey
Visual Relationship Detection using Scene Graphs: A Survey
Aniket Agarwal
Ayush Mangal
Vipul
GNN
70
21
0
16 May 2020
Adaptive Transformers for Learning Multimodal Representations
Adaptive Transformers for Learning Multimodal Representations
Prajjwal Bhargava
21
4
0
15 May 2020
A Novel Fusion of Attention and Sequence to Sequence Autoencoders to
  Predict Sleepiness From Speech
A Novel Fusion of Attention and Sequence to Sequence Autoencoders to Predict Sleepiness From Speech
Shahin Amiriparian
Pawel Winokurow
Vincent Karas
Sandra Ottl
Maurice Gerczuk
Björn W. Schuller
53
6
0
15 May 2020
Behind the Scene: Revealing the Secrets of Pre-trained
  Vision-and-Language Models
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
Jize Cao
Zhe Gan
Yu Cheng
Licheng Yu
Yen-Chun Chen
Jingjing Liu
VLM
123
130
0
15 May 2020
Dense-Caption Matching and Frame-Selection Gating for Temporal
  Localization in VideoQA
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA
Hyounghun Kim
Zineng Tang
Joey Tianyi Zhou
80
31
0
13 May 2020
Cross-Modality Relevance for Reasoning on Language and Vision
Cross-Modality Relevance for Reasoning on Language and Vision
Chen Zheng
Quan Guo
Parisa Kordjamshidi
LRM
88
36
0
12 May 2020
Non-Autoregressive Image Captioning with Counterfactuals-Critical
  Multi-Agent Learning
Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning
Longteng Guo
Jing Liu
Xinxin Zhu
Xingjian He
Jie Jiang
Hanqing Lu
BDL
73
58
0
10 May 2020
Character Matters: Video Story Understanding with Character-Aware
  Relations
Character Matters: Video Story Understanding with Character-Aware Relations
Shijie Geng
Ji Zhang
Zuohui Fu
Peng Gao
Hang Zhang
Gerard de Melo
135
11
0
09 May 2020
History for Visual Dialog: Do we really need it?
History for Visual Dialog: Do we really need it?
Shubham Agarwal
Trung Bui
Joon-Young Lee
Ioannis Konstas
Verena Rieser
VLM
38
71
0
08 May 2020
Modeling Human Visual Search Performance on Realistic Webpages Using
  Analytical and Deep Learning Methods
Modeling Human Visual Search Performance on Realistic Webpages Using Analytical and Deep Learning Methods
Arianna Yuan
Yongqian Li
HAI
56
25
0
07 May 2020
Text Recognition in the Wild: A Survey
Text Recognition in the Wild: A Survey
Xiaoxue Chen
Lianwen Jin
Yuanzhi Zhu
Canjie Luo
Tianwei Wang
3DV
128
105
0
07 May 2020
Unsupervised Multimodal Neural Machine Translation with Pseudo Visual
  Pivoting
Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting
Po-Yao (Bernie) Huang
Junjie Hu
Xiaojun Chang
Alexander G. Hauptmann
103
52
0
06 May 2020
Diagnosing the Environment Bias in Vision-and-Language Navigation
Diagnosing the Environment Bias in Vision-and-Language Navigation
Yubo Zhang
Hao Tan
Joey Tianyi Zhou
73
57
0
06 May 2020
Visual Question Answering with Prior Class Semantics
Visual Question Answering with Prior Class Semantics
Violetta Shevchenko
Damien Teney
A. Dick
Anton Van Den Hengel
BDL
55
7
0
04 May 2020
Probing Contextual Language Models for Common Ground with Visual
  Representations
Probing Contextual Language Models for Common Ground with Visual Representations
Gabriel Ilharco
Rowan Zellers
Ali Farhadi
Hannaneh Hajishirzi
118
14
0
01 May 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation
  Pre-training
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLMVLMOffRLAI4TS
133
507
0
01 May 2020
Improving Vision-and-Language Navigation with Image-Text Pairs from the
  Web
Improving Vision-and-Language Navigation with Image-Text Pairs from the Web
Arjun Majumdar
Ayush Shrivastava
Stefan Lee
Peter Anderson
Devi Parikh
Dhruv Batra
LM&Ro
196
236
0
30 Apr 2020
Towards Embodied Scene Description
Towards Embodied Scene Description
Sinan Tan
Huaping Liu
Di Guo
Xinyu Zhang
F. Sun
LM&Ro
52
9
0
30 Apr 2020
Dynamic Language Binding in Relational Visual Reasoning
Dynamic Language Binding in Relational Visual Reasoning
T. Le
Vuong Le
Svetha Venkatesh
T. Tran
NAI
71
19
0
30 Apr 2020
Explainable Deep Learning: A Field Guide for the Uninitiated
Explainable Deep Learning: A Field Guide for the Uninitiated
Gabrielle Ras
Ning Xie
Marcel van Gerven
Derek Doran
AAMLXAI
120
382
0
30 Apr 2020
Pragmatic Issue-Sensitive Image Captioning
Pragmatic Issue-Sensitive Image Captioning
Allen Nie
Reuben Cohn-Gordon
Christopher Potts
53
24
0
29 Apr 2020
Image Captioning through Image Transformer
Image Captioning through Image Transformer
Sen He
Wentong Liao
Hamed R. Tavakoli
M. Yang
Bodo Rosenhahn
N. Pugeault
ViT
95
94
0
29 Apr 2020
Cross-modal Speaker Verification and Recognition: A Multilingual
  Perspective
Cross-modal Speaker Verification and Recognition: A Multilingual Perspective
M. S. Saeed
Shah Nawaz
Pietro Morerio
Arif Mahmood
I. Gallo
Muhammad Haroon Yousaf
Alessio Del Bue
CVBM
84
27
0
28 Apr 2020
A Novel Attention-based Aggregation Function to Combine Vision and
  Language
A Novel Attention-based Aggregation Function to Combine Vision and Language
Matteo Stefanini
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
VLM
55
9
0
27 Apr 2020
Differentiable Adaptive Computation Time for Visual Reasoning
Differentiable Adaptive Computation Time for Visual Reasoning
Cristobal Eyzaguirre
Á. Soto
68
18
0
27 Apr 2020
Deep Multimodal Neural Architecture Search
Deep Multimodal Neural Architecture Search
Zhou Yu
Yuhao Cui
Jun-chen Yu
Meng Wang
Dacheng Tao
Qi Tian
70
100
0
25 Apr 2020
MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond
MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond
Duy-Kien Nguyen
Vedanuj Goswami
Xinlei Chen
71
23
0
24 Apr 2020
Visual Question Answering Using Semantic Information from Image
  Descriptions
Visual Question Answering Using Semantic Information from Image Descriptions
Tasmia Tasrin
Md Sultan al Nahian
Brent Harrison
28
0
0
23 Apr 2020
VisualCOMET: Reasoning about the Dynamic Context of a Still Image
VisualCOMET: Reasoning about the Dynamic Context of a Still Image
J. S. Park
Chandra Bhagavatula
Roozbeh Mottaghi
Ali Farhadi
Yejin Choi
ReLMLRM
75
6
0
22 Apr 2020
ParaCNN: Visual Paragraph Generation via Adversarial Twin Contextual
  CNNs
ParaCNN: Visual Paragraph Generation via Adversarial Twin Contextual CNNs
Shiyang Yan
Yang Hua
N. Robertson
73
7
0
21 Apr 2020
Experience Grounds Language
Experience Grounds Language
Yonatan Bisk
Ari Holtzman
Jesse Thomason
Jacob Andreas
Yoshua Bengio
...
Angeliki Lazaridou
Jonathan May
Aleksandr Nisnevich
Nicolas Pinto
Joseph P. Turian
126
361
0
21 Apr 2020
Transformer Reasoning Network for Image-Text Matching and Retrieval
Transformer Reasoning Network for Image-Text Matching and Retrieval
Nicola Messina
Fabrizio Falchi
Andrea Esuli
Giuseppe Amato
ViT
68
58
0
20 Apr 2020
Learning What Makes a Difference from Counterfactual Examples and
  Gradient Supervision
Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision
Damien Teney
Ehsan Abbasnejad
Anton Van Den Hengel
OODSSLCML
93
119
0
20 Apr 2020
Graph-Structured Referring Expression Reasoning in The Wild
Graph-Structured Referring Expression Reasoning in The Wild
Sibei Yang
Guanbin Li
Yizhou Yu
NAI
74
95
0
19 Apr 2020
Are we pretraining it right? Digging deeper into visio-linguistic
  pretraining
Are we pretraining it right? Digging deeper into visio-linguistic pretraining
Amanpreet Singh
Vedanuj Goswami
Devi Parikh
VLM
78
48
0
19 Apr 2020
Transform and Tell: Entity-Aware News Image Captioning
Transform and Tell: Entity-Aware News Image Captioning
Alasdair Tran
A. Mathews
Lexing Xie
VLM
60
97
0
17 Apr 2020
Knowledge-Based Visual Question Answering in Videos
Knowledge-Based Visual Question Answering in Videos
Noa Garcia
Mayu Otani
Chenhui Chu
Yuta Nakashima
18
0
0
17 Apr 2020
Reasoning Visual Dialog with Sparse Graph Learning and Knowledge
  Transfer
Reasoning Visual Dialog with Sparse Graph Learning and Knowledge Transfer
Gi-Cheon Kang
Junseok Park
Hwaran Lee
Byoung-Tak Zhang
Jin-Hwa Kim
VLM
62
10
0
14 Apr 2020
Relation Transformer Network
Relation Transformer Network
Rajat Koner
Poulami Sinhamahapatra
Volker Tresp
ViT
107
33
0
13 Apr 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
...
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
VLM
209
1,954
0
13 Apr 2020
Visual Grounding Methods for VQA are Working for the Wrong Reasons!
Visual Grounding Methods for VQA are Working for the Wrong Reasons!
Robik Shrestha
Kushal Kafle
Christopher Kanan
CML
66
35
0
12 Apr 2020
Previous
123...282930...363738
Next