Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1707.07998
Cited By
v1
v2
v3 (latest)
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
25 July 2017
Peter Anderson
Xiaodong He
Chris Buehler
Damien Teney
Mark Johnson
Stephen Gould
Lei Zhang
AIMat
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering"
50 / 1,868 papers shown
Title
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
Tongtian Yue
Longteng Guo
Yepeng Tang
Zijia Zhao
Xinxin Zhu
Hua Huang
Jing Liu
MLLM
VLM
16
0
0
20 Jun 2025
Multimodal Large Language Models for Medical Report Generation via Customized Prompt Tuning
Chunlei Li
Jingyang Hou
Yilei Shi
Jingliang Hu
Xiao Xiang Zhu
Lichao Mou
LM&MA
28
0
0
18 Jun 2025
ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM
Yujun Wang
Jinhe Bi
Yunpu Ma
Soeren Pirk
MLLM
39
0
0
17 Jun 2025
Rethinking Explainability in the Era of Multimodal AI
Chirag Agarwal
19
0
0
16 Jun 2025
Efficiency Robustness of Dynamic Deep Learning Systems
Ravishka Rathnasuriya
Tingxi Li
Zexin Xu
Zihe Song
Mirazul Haque
Simin Chen
Wei Yang
AAML
SILM
145
0
0
12 Jun 2025
Vision Generalist Model: A Survey
Ziyi Wang
Yongming Rao
Shuofeng Sun
Xinrun Liu
Yi Wei
...
Zuyan Liu
Yanbo Wang
Hongmin Liu
Jie Zhou
Jiwen Lu
65
0
0
11 Jun 2025
Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations
Yibo Cui
Liang Xie
Yu Zhao
Jiawei Sun
Erwei Yin
17
0
0
10 Jun 2025
Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation
Israa A. Albadarneh
Bassam Hammo
Omar Al-Kadi
VLM
27
0
0
03 Jun 2025
Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models
Ying Yang
Jie Zhang
Xiao Lv
Di Lin
Tao Xiang
Qing Guo
AAML
VLM
38
0
0
30 May 2025
Multi-Sourced Compositional Generalization in Visual Question Answering
Chuanhao Li
Wenbo Ye
Zhen Li
Yuwei Wu
Yunde Jia
CoGe
55
0
0
29 May 2025
Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language
Guangfu Hao
Haojie Wen
Liangxuna Guo
Yang Chen
Yanchao Bi
S. Yu
49
0
0
28 May 2025
MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question Answering
Xu Li
Fan Lyu
LRM
15
0
0
26 May 2025
GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance
Mohammad Mahdi Moradi
Sudhir Mudur
100
0
0
25 May 2025
Visual Question Answering on Multiple Remote Sensing Image Modalities
Hichem Boussaid
Lucrezia Tosato
F. Weissgerber
Camille Kurtz
Laurent Wendling
Sylvain Lobry
60
0
0
21 May 2025
TDFormer: A Top-Down Attention-Controlled Spiking Transformer
Zizheng Zhu
Yingchao Yu
Zeqi Zheng
Zhaofei Yu
Yaochu Jin
89
0
0
17 May 2025
Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures
Shun Inadumi
Nobuhiro Ueda
Koichiro Yoshino
ObjD
72
0
0
16 May 2025
Variational Visual Question Answering
Tobias Jan Wieczorek
Nathalie Daun
Mohammad Emtiyaz Khan
Marcus Rohrbach
OOD
89
0
0
14 May 2025
ViMRHP: A Vietnamese Benchmark Dataset for Multimodal Review Helpfulness Prediction via Human-AI Collaborative Annotation
T. Nguyen
D. Nguyen
Son T. Luu
Kiet Van Nguyen
50
0
0
12 May 2025
Describe Anything in Medical Images
Xi Xiao
Yunbei Zhang
Thanh-Huy Nguyen
Ba Thinh Lam
Janet Wang
...
Xiaobei Wang
Xiao Wang
Hao Xu
Tianming Liu
Min Xu
MedIm
VLM
184
0
0
09 May 2025
Multimodal Graph Representation Learning for Robust Surgical Workflow Recognition with Adversarial Feature Disentanglement
Long Bai
Boyi Ma
Ruohan Wang
Guankun Wang
Beilei Cui
...
Mobarakol Islam
Zhe Min
Jiewen Lai
Nassir Navab
Hongliang Ren
130
0
0
03 May 2025
Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI
Hugo Georgenthum
Cristian Cosentino
Fabrizio Marozzo
Pietro Liò
MedIm
443
0
0
28 Apr 2025
Tri-FusionNet: Enhancing Image Description Generation with Transformer-based Fusion Network and Dual Attention Mechanism
Lakshita Agarwal
Bindu Verma
ViT
129
0
0
23 Apr 2025
Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning
Yassir Benhammou
Alessandro Tiberio
Gabriel Trautmann
Suman Kalyan
MLLM
VLM
71
0
0
21 Apr 2025
Hadamard product in deep learning: Introduction, Advances and Challenges
Grigorios G. Chrysos
Yongtao Wu
Razvan Pascanu
Philip Torr
Volkan Cevher
AAML
168
2
0
17 Apr 2025
DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation
Sang-Jun Park
Keun-Soo Heo
Dong-Hee Shin
Young-Han Son
Ji-Hye Oh
Tae-Eui Kam
MedIm
53
0
0
16 Apr 2025
Building Trustworthy Multimodal AI: A Review of Fairness, Transparency, and Ethics in Vision-Language Tasks
Mohammad Saleha
Azadeh Tabatabaeib
148
0
0
14 Apr 2025
AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions
Xing Zi
Tengjun Ni
Xianjing Fan
Xian Tao
Jun Li
Ali Braytee
Mukesh Prasad
55
0
0
13 Apr 2025
Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception
Ruotian Peng
Haiying He
Yake Wei
Yandong Wen
D. Hu
VLM
72
0
0
09 Apr 2025
Feedback-Enhanced Hallucination-Resistant Vision-Language Model for Real-Time Scene Understanding
Zahir Alsulaimawi
41
0
0
07 Apr 2025
Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering
Jiaqi Deng
Kaize Shi
Zonghan Wu
Huan Huo
Dingxian Wang
Guandong Xu
40
0
0
05 Apr 2025
QIRL: Boosting Visual Question Answering via Optimized Question-Image Relation Learning
Quanxing Xu
Ling Zhou
Xian Zhong
Feifei Zhang
Rubing Huang
Chia-Wen Lin
63
0
0
04 Apr 2025
Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention
Jiuniu Wang
Wenjia Xu
Qingzhong Wang
Antoni B. Chan
181
0
0
03 Apr 2025
PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks
Abdelrahman Elskhawy
Mengze Li
Nassir Navab
Benjamin Busam
VLM
95
1
0
01 Apr 2025
Semantic-Spatial Feature Fusion with Dynamic Graph Refinement for Remote Sensing Image Captioning
Maofu Liu
Jiahui Liu
Xiaokang Zhang
105
1
0
30 Mar 2025
Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module
Yishen Liu
Shengda Liu
Hudan Pan
MedIm
73
0
0
24 Mar 2025
Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching
Yang Liu
Wentao Feng
Zhuoyao Liu
Shudong Huang
Jiancheng Lv
DiffM
VLM
114
0
0
19 Mar 2025
ChatBEV: A Visual Language Model that Understands BEV Maps
Qingyao Xu
Tian Jin
Guang Chen
Yanfeng Wang
Yize Zhang
70
1
0
18 Mar 2025
Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives
Sara Sarto
Marcella Cornia
Rita Cucchiara
84
1
0
18 Mar 2025
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era
Kanzhi Cheng
Wenpo Song
Jiaxin Fan
Zheng Ma
Qiushi Sun
Fangzhi Xu
Chenyang Yan
Nuo Chen
Jianbing Zhang
Jiajun Chen
MLLM
VLM
95
3
0
16 Mar 2025
SuperCap: Multi-resolution Superpixel-based Image Captioning
Henry Senior
Luca Rossi
Gregory Slabaugh
Shanxin Yuan
VLM
108
0
0
11 Mar 2025
Measuring directional bias amplification in image captions using predictability
Rahul Nair
Bhanu Tokas
Neel Shah
Hannah Kerner
113
0
0
10 Mar 2025
A Benchmark for Multi-Lingual Vision-Language Learning in Remote Sensing Image Captioning
Qing Zhou
Tao Yang
Junyu Gao
W. Ni
Junzheng Wu
Qi Wang
78
0
0
06 Mar 2025
AC-Lite : A Lightweight Image Captioning Model for Low-Resource Assamese Language
Pankaj Choudhury
Yogesh Aggarwal
Prabhanjan Jadhav
Prithwijit Guha
Sukumar Nandi
199
0
0
03 Mar 2025
OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels
Meng Lou
Yizhou Yu
313
2
0
27 Feb 2025
LOVA3: Learning to Visual Question Answering, Asking and Assessment
Henry Hengyuan Zhao
Pan Zhou
Difei Gao
Zechen Bai
Mike Zheng Shou
158
9
0
21 Feb 2025
Predicate Hierarchies Improve Few-Shot State Classification
Emily Jin
Joy Hsu
Jiajun Wu
OffRL
146
0
0
18 Feb 2025
Color Universal Design Neural Network for the Color Vision Deficiencies
Sunyong Seo
Jinho Park
102
0
0
12 Feb 2025
PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures
Shivalika Singh
Nakul Sharma
Manish Gupta
Anand Mishra
143
1
0
28 Jan 2025
An Ensemble Model with Attention Based Mechanism for Image Captioning
Israa Al Badarneh
Bassam Hammo
Omar Al-Kadi
198
6
0
28 Jan 2025
Combining Knowledge Graph and LLMs for Enhanced Zero-shot Visual Question Answering
Qian Tao
Xiaoyang Fan
Yong Xu
Xingquan Zhu
Yufei Tang
77
0
0
22 Jan 2025
1
2
3
4
...
36
37
38
Next