ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1707.07998
  4. Cited By
Bottom-Up and Top-Down Attention for Image Captioning and Visual
  Question Answering
v1v2v3 (latest)

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

25 July 2017
Peter Anderson
Xiaodong He
Chris Buehler
Damien Teney
Mark Johnson
Stephen Gould
Lei Zhang
    AIMat
ArXiv (abs)PDFHTML

Papers citing "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering"

50 / 1,868 papers shown
Title
Visually Grounded VQA by Lattice-based Retrieval
Visually Grounded VQA by Lattice-based Retrieval
Daniel Reich
F. Putze
Tanja Schultz
45
2
0
15 Nov 2022
Zero-shot Image Captioning by Anchor-augmented Vision-Language Space
  Alignment
Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment
Junyan Wang
Yi Zhang
Ming Yan
Ji Zhang
Jitao Sang
VLM
55
9
0
14 Nov 2022
VieCap4H-VLSP 2021: ObjectAoA-Enhancing performance of Object Relation
  Transformer with Attention on Attention for Vietnamese image captioning
VieCap4H-VLSP 2021: ObjectAoA-Enhancing performance of Object Relation Transformer with Attention on Attention for Vietnamese image captioning
Nghia Hieu Nguyen
Duong T.D. Vo
Minh-Quan Ha
ViT
48
1
0
10 Nov 2022
Towards Reasoning-Aware Explainable VQA
Towards Reasoning-Aware Explainable VQA
Rakesh Vaideeswaran
Feng Gao
Abhinav Mathur
Govind Thattai
LRM
83
3
0
09 Nov 2022
Portmanteauing Features for Scene Text Recognition
Portmanteauing Features for Scene Text Recognition
Yew Lee Tan
Ernest Yu Kai Chew
A. Kong
Jung-jae Kim
J. Lim
70
0
0
09 Nov 2022
ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for
  Understanding and Generation
ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation
Bin Shan
Yaqian Han
Weichong Yin
Shuohuan Wang
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
MLLMVLM
88
8
0
09 Nov 2022
Adaptive Contrastive Learning on Multimodal Transformer for Review
  Helpfulness Predictions
Adaptive Contrastive Learning on Multimodal Transformer for Review Helpfulness Predictions
Thong Nguyen
Xiaobao Wu
Anh Tuan Luu
Cong-Duy Nguyen
Zhen Hai
Lidong Bing
82
13
0
07 Nov 2022
CLOP: Video-and-Language Pre-Training with Knowledge Regularizations
CLOP: Video-and-Language Pre-Training with Knowledge Regularizations
Guohao Li
Hu Yang
Feng He
Zhifan Feng
Yajuan Lyu
Hua Wu
Haifeng Wang
VLM
45
1
0
07 Nov 2022
OSIC: A New One-Stage Image Captioner Coined
OSIC: A New One-Stage Image Captioner Coined
Bo Wang
Zhao Zhang
Ming Zhao
Xiaojie Jin
Mingliang Xu
Meng Wang
VLM
74
4
0
04 Nov 2022
CAMANet: Class Activation Map Guided Attention Network for Radiology
  Report Generation
CAMANet: Class Activation Map Guided Attention Network for Radiology Report Generation
Jun Wang
A. Bhalerao
Terry Yin
Simon See
Yulan He
MedIm
78
18
0
02 Nov 2022
Text-Only Training for Image Captioning using Noise-Injected CLIP
Text-Only Training for Image Captioning using Noise-Injected CLIP
David Nukrai
Ron Mokady
Amir Globerson
VLMCLIP
138
98
0
01 Nov 2022
Training Vision-Language Models with Less Bimodal Supervision
Training Vision-Language Models with Less Bimodal Supervision
Elad Segal
Ben Bogin
Jonathan Berant
VLM
48
2
0
01 Nov 2022
What's Different between Visual Question Answering for Machine
  "Understanding" Versus for Accessibility?
What's Different between Visual Question Answering for Machine "Understanding" Versus for Accessibility?
Yang Trista Cao
Kyle Seelman
Kyungjun Lee
Hal Daumé
41
5
0
26 Oct 2022
Visual Semantic Parsing: From Images to Abstract Meaning Representation
Visual Semantic Parsing: From Images to Abstract Meaning Representation
M. A. Abdelsalam
Zhan Shi
Federico Fancellu
Kalliopi Basioti
Dhaivat Bhatt
Vladimir Pavlovic
Afsaneh Fazly
GNN
85
4
0
26 Oct 2022
Compressing And Debiasing Vision-Language Pre-Trained Models for Visual
  Question Answering
Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering
Q. Si
Yuanxin Liu
Zheng Lin
Peng Fu
Weiping Wang
VLM
117
1
0
26 Oct 2022
Multilingual Multimodal Learning with Machine Translated Text
Multilingual Multimodal Learning with Machine Translated Text
Chen Qiu
Dan Oneaţă
Emanuele Bugliarello
Stella Frank
Desmond Elliott
121
15
0
24 Oct 2022
Dissecting Deep Metric Learning Losses for Image-Text Retrieval
Dissecting Deep Metric Learning Losses for Image-Text Retrieval
Hong Xuan
Xi Chen
66
2
0
21 Oct 2022
Image-Text Retrieval with Binary and Continuous Label Supervision
Image-Text Retrieval with Binary and Continuous Label Supervision
Zheng Li
Caili Guo
Zerun Feng
Lei Li
Ying Jin
Yufeng Zhang
VLM
71
4
0
20 Oct 2022
A Survey of Computer Vision Technologies In Urban and
  Controlled-environment Agriculture
A Survey of Computer Vision Technologies In Urban and Controlled-environment Agriculture
Jiayun Luo
Boyang Albert Li
Cyril Leung
144
15
0
20 Oct 2022
Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text
  Generation
Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation
Yu Zhao
Jianguo Wei
Zhichao Lin
Yueheng Sun
Meishan Zhang
Hao Fei
68
16
0
20 Oct 2022
Grounded Video Situation Recognition
Grounded Video Situation Recognition
Zeeshan Khan
C. V. Jawahar
Makarand Tapaswi
92
14
0
19 Oct 2022
CPL: Counterfactual Prompt Learning for Vision and Language Models
CPL: Counterfactual Prompt Learning for Vision and Language Models
Xuehai He
Diji Yang
Weixi Feng
Tsu-Jui Fu
Arjun Reddy Akula
Varun Jampani
P. Narayana
Sugato Basu
William Yang Wang
Xinze Wang
VPVLMVLM
94
15
0
19 Oct 2022
Probing Cross-modal Semantics Alignment Capability from the Textual
  Perspective
Probing Cross-modal Semantics Alignment Capability from the Textual Perspective
Zheng Ma
Shi Zong
Mianzhi Pan
Jianbing Zhang
Shujian Huang
Xinyu Dai
Jiajun Chen
54
4
0
18 Oct 2022
CNT (Conditioning on Noisy Targets): A new Algorithm for Leveraging
  Top-Down Feedback
CNT (Conditioning on Noisy Targets): A new Algorithm for Leveraging Top-Down Feedback
Alexia Jolicoeur-Martineau
Alex Lamb
Vikas Verma
Aniket Didolkar
NoLa
30
0
0
18 Oct 2022
Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval
Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval
Xuri Ge
Fuhai Chen
Songpei Xu
Fuxiang Tao
J. Jose
57
26
0
17 Oct 2022
Novel 3D Scene Understanding Applications From Recurrence in a Single
  Image
Novel 3D Scene Understanding Applications From Recurrence in a Single Image
Shimian Zhang
Skanda Bharadwaj
Keaton Kraiger
Yashasvi Asthana
Hong Zhang
R. Collins
Yanxi Liu
121
1
0
14 Oct 2022
Hybrid Reinforced Medical Report Generation with M-Linear Attention and
  Repetition Penalty
Hybrid Reinforced Medical Report Generation with M-Linear Attention and Repetition Penalty
Wenting Xu
Zhenghua Xu
Junyang Chen
Chang Qi
Thomas Lukasiewicz
MedIm
62
8
0
14 Oct 2022
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge
  Distillation and Modal-adaptive Pruning
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning
Tiannan Wang
Wangchunshu Zhou
Yan Zeng
Xinsong Zhang
VLM
82
44
0
14 Oct 2022
Plausible May Not Be Faithful: Probing Object Hallucination in
  Vision-Language Pre-training
Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training
Wenliang Dai
Zihan Liu
Ziwei Ji
Dan Su
Pascale Fung
MLLMVLM
86
67
0
14 Oct 2022
One does not fit all! On the Complementarity of Vision Encoders for
  Vision and Language Tasks
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks
Gregor Geigle
Chen Cecilia Liu
Jonas Pfeiffer
Iryna Gurevych
VLM
50
1
0
12 Oct 2022
Learning by Asking Questions for Knowledge-based Novel Object
  Recognition
Learning by Asking Questions for Knowledge-based Novel Object Recognition
Kohei Uehara
Tatsuya Harada
24
1
0
12 Oct 2022
Generating image captions with external encyclopedic knowledge
Generating image captions with external encyclopedic knowledge
S. Nikiforova
Tejaswini Deoskar
Denis Paperno
Yoad Winter
72
2
0
10 Oct 2022
Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced
  Hard Negatives for Cross-modal Information Retrieval
Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information Retrieval
Yan Gong
Georgina Cosma
71
11
0
10 Oct 2022
Towards Robust Visual Question Answering: Making the Most of Biased
  Samples via Contrastive Learning
Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning
Q. Si
Yuanxin Liu
Fandong Meng
Zheng Lin
Peng Fu
Yanan Cao
Weiping Wang
Jie Zhou
88
24
0
10 Oct 2022
Learning Fine-Grained Visual Understanding for Video Question Answering
  via Decoupling Spatial-Temporal Modeling
Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling
Hsin-Ying Lee
Hung-Ting Su
Bing-Chen Tsai
Tsung-Han Wu
Jia-Fong Yeh
Winston H. Hsu
86
2
0
08 Oct 2022
Contextual Modeling for 3D Dense Captioning on Point Clouds
Contextual Modeling for 3D Dense Captioning on Point Clouds
Yufeng Zhong
Longdao Xu
Jiebo Luo
Lin Ma
85
15
0
08 Oct 2022
Video Referring Expression Comprehension via Transformer with
  Content-aware Query
Video Referring Expression Comprehension via Transformer with Content-aware Query
Ji Jiang
Meng Cao
Tengtao Song
Yuexian Zou
83
5
0
06 Oct 2022
VLSNR:Vision-Linguistics Coordination Time Sequence-aware News
  Recommendation
VLSNR:Vision-Linguistics Coordination Time Sequence-aware News Recommendation
Songhao Han
Wei-Ping Huang
Xiaotian Luan Beihang University
AI4TS
74
3
0
06 Oct 2022
AOE-Net: Entities Interactions Modeling with Adaptive Attention
  Mechanism for Temporal Action Proposals Generation
AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation
Khoa T. Vo
Sang Truong
Kashu Yamazaki
Bhiksha Raj
Minh-Triet Tran
Ngan Le
156
30
0
05 Oct 2022
Vision+X: A Survey on Multimodal Learning in the Light of Data
Vision+X: A Survey on Multimodal Learning in the Light of Data
Ye Zhu
Yuehua Wu
N. Sebe
Yan Yan
105
19
0
05 Oct 2022
Learning to Collocate Visual-Linguistic Neural Modules for Image
  Captioning
Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning
Xu Yang
Hanwang Zhang
Chongyang Gao
Jianfei Cai
MLLM
81
10
0
04 Oct 2022
Extending Compositional Attention Networks for Social Reasoning in
  Videos
Extending Compositional Attention Networks for Social Reasoning in Videos
Christina Sartzetaki
Georgios Paraskevopoulos
Alexandros Potamianos
LRM
43
3
0
03 Oct 2022
A Dual-Attention Learning Network with Word and Sentence Embedding for
  Medical Visual Question Answering
A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering
Xiaofei Huang
Hongfang Gong
MedIm
106
14
0
01 Oct 2022
Task Formulation Matters When Learning Continually: A Case Study in
  Visual Question Answering
Task Formulation Matters When Learning Continually: A Case Study in Visual Question Answering
Mavina Nikandrou
Lu Yu
Alessandro Suglia
Ioannis Konstas
Verena Rieser
OOD
76
5
0
30 Sep 2022
SmallCap: Lightweight Image Captioning Prompted with Retrieval
  Augmentation
SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation
R. Ramos
Bruno Martins
Desmond Elliott
Yova Kementchedjhieva
VLM
89
89
0
30 Sep 2022
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text
  Pre-training
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
Bin Shan
Weichong Yin
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
VLM
75
19
0
30 Sep 2022
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual
  Grounding
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding
Fengyuan Shi
Ruopeng Gao
Weilin Huang
Limin Wang
105
28
0
28 Sep 2022
Unified Loss of Pair Similarity Optimization for Vision-Language
  Retrieval
Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval
Zheng Li
Caili Guo
Xin Eric Wang
Zerun Feng
Lei Li
Zhongtian Du
VLM
70
2
0
28 Sep 2022
A Survey on Graph Neural Networks and Graph Transformers in Computer
  Vision: A Task-Oriented Perspective
A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective
Chaoqi Chen
Yushuang Wu
Qiyuan Dai
Hong-Yu Zhou
Mutian Xu
Sibei Yang
Xiaoguang Han
Yizhou Yu
ViTMedImAI4CE
137
80
0
27 Sep 2022
Word to Sentence Visual Semantic Similarity for Caption Generation:
  Lessons Learned
Word to Sentence Visual Semantic Similarity for Caption Generation: Lessons Learned
Ahmed Sabir
119
0
0
26 Sep 2022
Previous
123...111213...363738
Next