Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1707.07998
Cited By
v1
v2
v3 (latest)
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
25 July 2017
Peter Anderson
Xiaodong He
Chris Buehler
Damien Teney
Mark Johnson
Stephen Gould
Lei Zhang
AIMat
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering"
50 / 1,868 papers shown
Title
Few-shot Action Recognition with Captioning Foundation Models
Xiang Wang
Shiwei Zhang
Hangjie Yuan
Yingya Zhang
Changxin Gao
Deli Zhao
Nong Sang
VLM
126
7
0
16 Oct 2023
Bounding and Filling: A Fast and Flexible Framework for Image Captioning
Zheng Ma
Changxin Wang
Bo Huang
Zi-Yue Zhu
Jianbing Zhang
60
1
0
15 Oct 2023
Question Answering for Electronic Health Records: A Scoping Review of datasets and models
Jayetri Bardhan
Kirk Roberts
Daisy Zhe Wang
79
2
0
12 Oct 2023
A Comparative Study of Pre-trained CNNs and GRU-Based Attention for Image Caption Generation
Rashid Khan
Bingding Huang
Haseeb Hassan
Asim Zaman
Z. Ye
41
2
0
11 Oct 2023
Controllable Chest X-Ray Report Generation from Longitudinal Representations
Francesco Dalla Serra
Chaoyang Wang
Fani Deligianni
Jeffrey Stephen Dalton
Alison Q. OÑeil
MedIm
111
16
0
09 Oct 2023
C^2M-DoT: Cross-modal consistent multi-view medical report generation with domain transfer network
Ruizhi Wang
Xiang-Fei Wang
Jie Zhou
Thomas Lukasiewicz
Zhenghua Xu
71
1
0
09 Oct 2023
Module-wise Adaptive Distillation for Multimodality Foundation Models
Chen Liang
Jiahui Yu
Ming-Hsuan Yang
Matthew A. Brown
Huayu Chen
Tuo Zhao
Boqing Gong
Tianyi Zhou
104
10
0
06 Oct 2023
Constructing Image-Text Pair Dataset from Books
Yamato Okamoto
Haruto Toyonaga
Yoshihisa Ijiri
Hirokatsu Kataoka
79
3
0
03 Oct 2023
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval
Hao Li
Marie-Jeanne Lesot
Lianli Gao
Xiaosu Zhu
Christophe Marsala
EDL
78
15
0
29 Sep 2023
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
Yangyang Guo
Haoyu Zhang
Yongkang Wong
Liqiang Nie
Mohan Kankanhalli
VLM
69
3
0
28 Sep 2023
Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal Sponsored Search
Yuanmin Tang
Daling Wang
Keke Gai
Wenfang Wu
Yifei Zhang
Gang Xiong
Qi Wu
73
4
0
28 Sep 2023
Targeted Image Data Augmentation Increases Basic Skills Captioning Robustness
Valentin Barriere
Felipe del Rio
Andres Carvallo De Ferari
Carlos Aspillaga
Eugenio Herrera-Berg
Cristian Buc Calderon
DiffM
63
0
0
27 Sep 2023
Object-Centric Open-Vocabulary Image-Retrieval with Aggregated Features
Hila Levi
Guy Heller
Dan Levi
Ethan Fetaya
OCL
VLM
69
4
0
26 Sep 2023
BLIP-Adapter: Parameter-Efficient Transfer Learning for Mobile Screenshot Captioning
Ching-Yu Chiang
I-Hua Chang
Shih-Wei Liao
83
1
0
26 Sep 2023
A Survey on Image-text Multimodal Models
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai-Nguyen Nguyen
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
VLM
128
7
0
23 Sep 2023
Towards Answering Health-related Questions from Medical Videos: Datasets and Approaches
Deepak Gupta
Kush Attal
Dina Demner-Fushman
LM&MA
49
1
0
21 Sep 2023
A Novel Method of Fuzzy Topic Modeling based on Transformer Processing
Ching-Hsun Tseng
Shin-Jye Lee
Po-Wei Cheng
Chien Lee
Chih-Chieh Hung
31
0
0
18 Sep 2023
Syntax Tree Constrained Graph Network for Visual Question Answering
Xiangrui Su
Qi Zhang
Chongyang Shi
Jiachang Liu
Liang Hu
GNN
NAI
58
3
0
17 Sep 2023
Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking
Wenzhang Wei
Zhipeng Gui
Changguang Wu
Anqi Zhao
D. Peng
Huayi Wu
77
0
0
15 Sep 2023
Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary Tasks
Danae Sánchez Villegas
Daniel Preoctiuc-Pietro
Nikolaos Aletras
64
3
0
14 Sep 2023
Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning
Enna Sachdeva
Nakul Agarwal
Suhas Chundi
Sean Roelofs
Jiachen Li
Mykel Kochenderfer
Chiho Choi
Behzad Dariush
92
51
0
12 Sep 2023
Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation
Yunhao Ge
Lyne Tchapmi
Brian Nlong Zhao
Neel Joshi
Laurent Itti
Vibhav Vineet
DiffM
77
14
0
12 Sep 2023
NExT-GPT: Any-to-Any Multimodal LLM
Shengqiong Wu
Hao Fei
Leigang Qu
Wei Ji
Tat-Seng Chua
MLLM
115
507
0
11 Sep 2023
Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning
Guisheng Liu
Yi Li
Zhengcong Fei
Haiyan Fu
Xiangyang Luo
Yanqing Guo
VLM
DiffM
83
8
0
10 Sep 2023
Towards Better Multi-modal Keyphrase Generation via Visual Entity Enhancement and Multi-granularity Image Noise Filtering
Yifan Dong
Suhang Wu
Fandong Meng
Jie Zhou
Xiaoli Wang
Jianxin Lin
Jinsong Su
79
4
0
09 Sep 2023
A Multimodal Analysis of Influencer Content on Twitter
Danae Sánchez Villegas
Catalina Goanta
Nikolaos Aletras
115
6
0
06 Sep 2023
Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning
Sijin Chen
Erik Cambria
Mingsheng Li
Xin Chen
Peng Guo
Yinjie Lei
Gang Yu
Taihao Li
Tao Chen
66
23
0
06 Sep 2023
A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models
Noriyuki Kojima
Hadar Averbuch-Elor
Yoav Artzi
70
2
0
06 Sep 2023
ATM: Action Temporality Modeling for Video Question Answering
Junwen Chen
Jie Zhu
Yu Kong
60
1
0
05 Sep 2023
S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning
Wei Suo
Mengyang Sun
Weisong Liu
Yi-Meng Gao
Peifeng Wang
Yanning Zhang
Qi Wu
LRM
68
7
0
05 Sep 2023
Towards Addressing the Misalignment of Object Proposal Evaluation for Vision-Language Tasks via Semantic Grounding
Joshua Forster Feinglass
Yezhou Yang
51
2
0
01 Sep 2023
Distraction-free Embeddings for Robust VQA
Atharvan Dogra
Deeksha Varshney
Ashwin Kalyan
Ameet Deshpande
Neeraj Kumar
100
0
0
31 Aug 2023
ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
Weihan Wang
Zhiyong Yang
Bin Xu
Juanzi Li
Yankui Sun
VLM
96
8
0
31 Aug 2023
Can Prompt Learning Benefit Radiology Report Generation?
Jun Wang
Lixing Zhu
A. Bhalerao
Yulan He
MedIm
86
2
0
30 Aug 2023
Finding-Aware Anatomical Tokens for Chest X-Ray Automated Reporting
Francesco Dalla Serra
Chaoyang Wang
Fani Deligianni
Jeffrey Stephen Dalton
Alison Q. OÑeil
MedIm
56
9
0
30 Aug 2023
Read-only Prompt Optimization for Vision-Language Few-shot Learning
Dongjun Lee
Seokwon Song
Jihee G. Suh
Joonmyeong Choi
S. Lee
Hyunwoo J.Kim
VLM
89
45
0
29 Aug 2023
UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory
Haiwen Diao
Bo Wan
Yanzhe Zhang
Xuecong Jia
Huchuan Lu
Long Chen
VLM
81
19
0
28 Aug 2023
With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning
Manuele Barraco
Sara Sarto
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
VLM
90
20
0
23 Aug 2023
CgT-GAN: CLIP-guided Text GAN for Image Captioning
Jiarui Yu
Haoran Li
Y. Hao
B. Zhu
Tong Xu
Xiangnan He
VLM
CLIP
65
13
0
23 Aug 2023
MusicJam: Visualizing Music Insights via Generated Narrative Illustrations
Chuer Chen
Nan Cao
Jiani Hou
Yi Guo
Yulei Zhang
Yang Shi
DiffM
61
0
0
22 Aug 2023
CiteTracker: Correlating Image and Text for Visual Tracking
Xin Li
Yuqing Huang
Zhenyu He
Yaowei Wang
Huchuan Lu
Ming-Hsuan Yang
99
30
0
22 Aug 2023
ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts
Bilel Benjdira
Anis Koubaa
Anas M. Ali
LM&Ro
58
4
0
22 Aug 2023
Explore and Tell: Embodied Visual Captioning in 3D Environments
Anwen Hu
Shizhe Chen
Liang Zhang
Qin Jin
LM&Ro
80
2
0
21 Aug 2023
Simple Baselines for Interactive Video Retrieval with Questions and Answers
Kaiqu Liang
Samuel Albanie
74
3
0
21 Aug 2023
Generic Attention-model Explainability by Weighted Relevance Accumulation
Yiming Huang
Ao Jia
Xiaodan Zhang
Jiawei Zhang
46
1
0
20 Aug 2023
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models
Navid Rajabi
Jana Kosecka
VLM
111
12
0
18 Aug 2023
Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning
Ye-Ting Chen
Siyu Zhang
Yaoru Sun
Weijian Liang
Haoran Wang
74
1
0
18 Aug 2023
Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks
Fawaz Sammani
Nikos Deligiannis
48
5
0
17 Aug 2023
Diagnosing Human-object Interaction Detectors
Fangrui Zhu
Yiming Xie
Weidi Xie
Huaizu Jiang
77
8
0
16 Aug 2023
Visually-Aware Context Modeling for News Image Captioning
Tingyu Qu
Tinne Tuytelaars
Marie-Francine Moens
VLM
58
9
0
16 Aug 2023
Previous
1
2
3
...
5
6
7
...
36
37
38
Next