v1v2v3 (latest)

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

25 July 2017

Lei Zhang

Papers citing "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering"

50 / 1,868 papers shown

Title
CapWAP: Captioning with a Purpose Adam Fisch Kenton Lee Ming-Wei Chang J. Clark Regina Barzilay 53 11 0 09 Nov 2020
Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games Alessandro Suglia Antonio Vergari Ioannis Konstas Yonatan Bisk E. Bastianelli Andrea Vanzo Oliver Lemon OCL 43 10 0 05 Nov 2020
DTGAN: Dual Attention Generative Adversarial Networks for Text-to-Image Generation Zhenxing Zhang Lambert Schomaker GAN 67 35 0 05 Nov 2020
Utilizing Every Image Object for Semi-supervised Phrase Grounding Haidong Zhu Arka Sadhu Zhao-Heng Zheng Ram Nevatia ObjD 66 7 0 05 Nov 2020
An Improved Attention for Visual Question Answering Tanzila Rahman Shih-Han Chou Leonid Sigal Giuseppe Carenini 55 45 0 04 Nov 2020
Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings Yue Wang Jing Li Michael R. Lyu Irwin King 75 16 0 03 Nov 2020
Dual Attention on Pyramid Feature Maps for Image Captioning Litao Yu Jian Zhang Qiang Wu 108 50 0 02 Nov 2020
Diverse Image Captioning with Context-Object Split Latent Spaces Shweta Mahajan Stefan Roth 64 42 0 02 Nov 2020
Boost Image Captioning with Knowledge Reasoning Feicheng Huang Zhixin Li Haiyang Wei Canlong Zhang Huifang Ma 38 25 0 02 Nov 2020
Exploring Dynamic Context for Multi-path Trajectory Prediction Hao Cheng Wentong Liao Xuejiao Tang M. Yang Monika Sester Bodo Rosenhahn 105 33 0 30 Oct 2020
Generating Radiology Reports via Memory-driven Transformer Zhihong Chen Yan Song Tsung-Hui Chang Xiang Wan MedIm 76 486 0 30 Oct 2020
Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a Class-imbalance View Yangyang Guo Liqiang Nie Zhiyong Cheng Q. Tian Min Zhang 116 70 0 30 Oct 2020
MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering Aisha Urooj Khan Amir Mazaheri N. Lobo M. Shah 97 57 0 27 Oct 2020
Learning Multi-Agent Coordination for Enhancing Target Coverage in Directional Sensor Networks Jing Xu Fangwei Zhong Yizhou Wang 83 50 0 25 Oct 2020
RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering Zanxia Jin Heran Wu Chun Yang Fang Zhou Jingyan Qin Lei Xiao Xu-Cheng Yin 88 31 0 24 Oct 2020
Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions Radhika Dua Sai Srinivas Kancheti V. Balasubramanian LRM 88 22 0 24 Oct 2020
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions Liunian Harold Li Haoxuan You Zhecan Wang Alireza Zareian Shih-Fu Chang Kai-Wei Chang SSL VLM 101 12 0 24 Oct 2020
Can images help recognize entities? A study of the role of images for Multimodal NER Shuguang Chen Gustavo Aguilar Leonardo Neves Thamar Solorio EgoV 90 37 0 23 Oct 2020
Show and Speak: Directly Synthesize Spoken Description of Images Xinsheng Wang Siyuan Feng Jihua Zhu M. Hasegawa-Johnson O. Scharenborg 152 4 0 23 Oct 2020
Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization Li Ren Keqin Li Liqiang Wang K. Hua 54 4 0 23 Oct 2020
Language-Conditioned Imitation Learning for Robot Manipulation Tasks Simon Stepputtis Joseph Campbell Mariano Phielipp Stefan Lee Chitta Baral H. B. Amor LM&Ro 200 205 0 22 Oct 2020
Learning Dual Semantic Relations with Graph Attention for Image-Text Matching Keyu Wen Xiaodong Gu Qingrong Cheng 76 97 0 22 Oct 2020
Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies Itai Gat Idan Schwartz Alex Schwing Tamir Hazan 106 92 0 21 Oct 2020
Bayesian Attention Modules Xinjie Fan Shujian Zhang Bo Chen Mingyuan Zhou 183 62 0 20 Oct 2020
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends Shagun Uppal Sarthak Bhagat Devamanyu Hazarika Navonil Majumdar Soujanya Poria Roger Zimmermann Amir Zadeh 101 6 0 19 Oct 2020
Image Captioning with Visual Object Representations Grounded in the Textual Modality Duvsan Varivs Katsuhito Sudoh Satoshi Nakamura 35 1 0 19 Oct 2020
Language and Visual Entity Relationship Graph for Agent Navigation Yicong Hong Cristian Rodriguez-Opazo Yuankai Qi Qi Wu Stephen Gould LM&Ro 226 135 0 19 Oct 2020
Unsupervised Foveal Vision Neural Networks with Top-Down Attention Ryan Burt Nina N. Thigpen A. Keil José C. Príncipe 56 2 0 18 Oct 2020
Hierarchical Conditional Relation Networks for Multimodal Video Question Answering T. Le Vuong Le Svetha Venkatesh T. Tran BDL 138 23 0 18 Oct 2020
Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering Hantao Huang Tao Han Wei Han D. Yap Cheng-Ming Chiang 28 4 0 17 Oct 2020
New Ideas and Trends in Deep Multimodal Content Understanding: A Review Wei Chen Weiping Wang Li Liu M. Lew VLM 169 33 0 16 Oct 2020
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs Ana Marasović Chandra Bhagavatula J. S. Park Ronan Le Bras Noah A. Smith Yejin Choi ReLM LRM 99 62 0 15 Oct 2020
The Benefit of Distraction: Denoising Remote Vitals Measurements using Inverse Attention E. Nowara Daniel J. McDuff Ashok Veeraraghavan 53 13 0 14 Oct 2020
Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think! Jack Hessel Lillian Lee 108 76 0 13 Oct 2020
DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video Cristian Rodriguez-Opazo Edison Marrese-Taylor Basura Fernando Hongdong Li Stephen Gould 192 10 0 13 Oct 2020
Contrast and Classify: Training Robust VQA Models Yash Kant A. Moudgil Dhruv Batra Devi Parikh Harsh Agrawal 55 5 0 13 Oct 2020
TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation Dongxu Li Chenchen Xu Xin Yu Kaihao Zhang Ben Swift H. Suominen Hongdong Li SLR 60 124 0 12 Oct 2020
MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding Qinxin Wang Hao Tan Sheng Shen Michael W. Mahoney Z. Yao ObjD 147 11 0 12 Oct 2020
Interpretable Neural Computation for Real-World Compositional Visual Question Answering Ruixue Tang Chao Ma CoGe 26 2 0 10 Oct 2020
Dense Relational Image Captioning via Multi-task Triple-Stream Networks Dong-Jin Kim Tae-Hyun Oh Jinsoo Choi In So Kweon 115 27 0 08 Oct 2020
Visual News: Benchmark and Challenges in News Image Captioning Fuxiao Liu Yinghan Wang Tianlu Wang Vicente Ordonez VLM 86 116 0 08 Oct 2020
Universal Weighting Metric Learning for Cross-Modal Matching Jiwei Wei Xing Xu Yang Yang Yanli Ji Zheng Wang Heng Tao Shen 70 89 0 07 Oct 2020
Vision Skills Needed to Answer Visual Questions Xiaoyu Zeng Yanan Wang Tai-Yin Chiu Nilavra Bhattacharya Danna Gurari 66 18 0 07 Oct 2020
Learning to Represent Image and Text with Denotation Graph Bowen Zhang Hexiang Hu Vihan Jain Eugene Ie Fei Sha 78 22 0 06 Oct 2020
Fine-Grained Grounding for Multimodal Speech Recognition Tejas Srinivasan Ramon Sanabria Florian Metze Desmond Elliott 76 11 0 05 Oct 2020
Attention Guided Semantic Relationship Parsing for Visual Question Answering M. Farazi Salman Khan Nick Barnes 43 2 0 05 Oct 2020
UNISON: Unpaired Cross-lingual Image Captioning Jiahui Gao Yi Zhou Philip L. H. Yu Shafiq Joty Jiuxiang Gu 82 17 0 03 Oct 2020
Taking Modality-free Human Identification as Zero-shot Learning Zhizhe Liu Xingxing Zhang Zhenfeng Zhu Shuai Zheng Yao Zhao Jian Cheng 56 4 0 02 Oct 2020
ISAAQ -- Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention José Manuél Gómez-Pérez Raúl Ortega 61 24 0 01 Oct 2020
Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue Zipeng Xu Fangxiang Feng Xiaojie Wang Yushu Yang Huixing Jiang Zhongyuan Ouyang 47 7 0 01 Oct 2020