Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1707.07998
Cited By
v1
v2
v3 (latest)
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
25 July 2017
Peter Anderson
Xiaodong He
Chris Buehler
Damien Teney
Mark Johnson
Stephen Gould
Lei Zhang
AIMat
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering"
50 / 1,868 papers shown
Title
Medical Vision-Language Pre-Training for Brain Abnormalities
Masoud Monajatipoor
Zi-Yi Dou
Aichi Chien
Nanyun Peng
Kai-Wei Chang
VLM
103
0
0
27 Apr 2024
Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models
Yuhang Huang
Zihan Wu
Chongyang Gao
Jiawei Peng
Xu Yang
73
2
0
26 Apr 2024
Learning text-to-video retrieval from image captioning
Lucas Ventura
Cordelia Schmid
Gül Varol
3DV
71
3
0
26 Apr 2024
3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting
Xuri Ge
Songpei Xu
Fuhai Chen
Jie Wang
Guoxin Wang
Shan An
Joemon M. Jose
3DPC
110
12
0
26 Apr 2024
Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples
Kuofeng Gao
Jindong Gu
Yang Bai
Shu-Tao Xia
Philip Torr
Wei Liu
Zhifeng Li
132
13
0
25 Apr 2024
Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering
Dongze Hao
Qunbo Wang
Longteng Guo
Jie Jiang
Jing Liu
65
1
0
22 Apr 2024
Sentiment-oriented Transformer-based Variational Autoencoder Network for Live Video Commenting
Fengyi Fu
Shancheng Fang
Weidong Chen
Zhendong Mao
ViT
VGen
56
4
0
19 Apr 2024
Resilience through Scene Context in Visual Referring Expression Generation
Simeon Junker
Sina Zarrieß
49
1
0
18 Apr 2024
Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning
Zhengyang Liang
Meiyu Liang
Wei Huang
Yawen Li
Zhe Xue
74
1
0
16 Apr 2024
Find The Gap: Knowledge Base Reasoning For Visual Question Answering
Elham J. Barezi
Parisa Kordjamshidi
58
1
0
16 Apr 2024
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
Quan Van Nguyen
Dan Quang Tran
Huy Quang Pham
Thang Kien-Bao Nguyen
Nghia Hieu Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
CoGe
170
5
0
16 Apr 2024
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
Jintao Sun
Zhedong Zheng
Gangyi Ding
Gangyi Ding
122
8
0
16 Apr 2024
`Eyes of a Hawk and Ears of a Fox': Part Prototype Network for Generalized Zero-Shot Learning
Joshua Forster Feinglass
Jayaraman J. Thiagarajan
Rushil Anirudh
T. S. Jayram
Yezhou Yang
VLM
54
0
0
12 Apr 2024
Improving Continuous Sign Language Recognition with Adapted Image Models
Lianyu Hu
Tongkai Shi
Liqing Gao
Zekang Liu
Wei Feng
VLM
86
5
0
12 Apr 2024
Text Data-Centric Image Captioning with Interactive Prompts
Yiyu Wang
Hao Luo
Jungang Xu
Yingfei Sun
Fan Wang
VLM
76
0
0
28 Mar 2024
Beyond Embeddings: The Promise of Visual Table in Visual Reasoning
Yiwu Zhong
Zi-Yuan Hu
Michael R. Lyu
Liwei Wang
57
1
0
27 Mar 2024
Semi-Supervised Image Captioning Considering Wasserstein Graph Matching
Yang Yang
94
0
0
26 Mar 2024
Image Captioning in news report scenario
Tianrui Liu
Qi Cai
Changxin Xu
Bo Hong
Jize Xiong
Yuxin Qiao
Tsungwei Yang
83
14
0
24 Mar 2024
Temporal-Spatial Object Relations Modeling for Vision-and-Language Navigation
Bowen Huang
Yanwei Zheng
Chuanlin Lan
Xinpeng Zhao
Yifei Zou
Dongxiao Yu
104
0
0
23 Mar 2024
Can 3D Vision-Language Models Truly Understand Natural Language?
Weipeng Deng
Jihan Yang
Runyu Ding
Jiahui Liu
Yijiang Li
Xiaojuan Qi
Edith C.H. Ngai
116
6
0
21 Mar 2024
HyperFusion: A Hypernetwork Approach to Multimodal Integration of Tabular and Medical Imaging Data for Predictive Modeling
Daniel Duenias
Brennan Nichyporuk
Tal Arbel
Tammy Riklin-Raviv
95
7
0
20 Mar 2024
Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation
Ming Xu
Zilong Xie
100
2
0
18 Mar 2024
Knowledge Condensation and Reasoning for Knowledge-based VQA
Dongze Hao
Jian Jia
Longteng Guo
Qunbo Wang
Te Yang
...
Yanhua Cheng
Bo Wang
Quan Chen
Han Li
Jing Liu
74
1
0
15 Mar 2024
Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering
Zhixuan Shen
Haonan Luo
Sijia Li
Tianrui Li
68
0
0
14 Mar 2024
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes
Ting Yu
Xiaojun Lin
Shuhui Wang
Weiguo Sheng
Qingming Huang
Jun-chen Yu
3DV
88
10
0
12 Mar 2024
Enhancing Image Caption Generation Using Reinforcement Learning with Human Feedback
L. AdarshN
V. ArunP
L. AravindhN
39
3
0
11 Mar 2024
How to Understand Named Entities: Using Common Sense for News Captioning
Ning Xu
Yanhui Wang
Tingting Zhang
Hongshuo Tian
Mohan Kankanhalli
An-An Liu
63
0
0
11 Mar 2024
HistGen: Histopathology Report Generation via Local-Global Feature Encoding and Cross-modal Context Interaction
Zhengrui Guo
Jiabo Ma
Ying Xu
Yihui Wang
Liansheng Wang
Hao Chen
110
22
0
08 Mar 2024
MeaCap: Memory-Augmented Zero-shot Image Captioning
Zequn Zeng
Yan Xie
Hao Zhang
Chiyu Chen
Zhengjue Wang
Boli Chen
VLM
86
15
0
06 Mar 2024
Causality-based Cross-Modal Representation Learning for Vision-and-Language Navigation
Liuyi Wang
Zongtao He
Ronghao Dang
Huiyi Chen
Chengju Liu
Qi Chen
94
1
0
06 Mar 2024
VEglue: Testing Visual Entailment Systems via Object-Aligned Joint Erasing
Zhiyuan Chang
Mingyang Li
Junjie Wang
Cheng Li
Qing Wang
58
0
0
05 Mar 2024
Zero-shot Generalizable Incremental Learning for Vision-Language Object Detection
Jieren Deng
Haojian Zhang
Kun Ding
Jianhua Hu
Xingxuan Zhang
Yunkuan Wang
VLM
ObjD
167
7
0
04 Mar 2024
Navigating Hallucinations for Reasoning of Unintentional Activities
Shresth Grover
Vibhav Vineet
Yogesh S Rawat
LRM
83
1
0
29 Feb 2024
VIXEN: Visual Text Comparison Network for Image Difference Captioning
Alexander Black
Jing Shi
Yifei Fai
Tu Bui
John Collomosse
72
5
0
29 Feb 2024
Polos: Multimodal Metric Learning from Human Feedback for Image Captioning
Yuiga Wada
Kanta Kaneda
Daichi Saito
Komei Sugiura
89
30
0
28 Feb 2024
Measuring Vision-Language STEM Skills of Neural Models
Jianhao Shen
Ye Yuan
Srbuhi Mirzoyan
Ming Zhang
Chenguang Wang
VLM
117
12
0
27 Feb 2024
Self-Supervised Interpretable End-to-End Learning via Latent Functional Modularity
Hyunki Seong
David Hyunchul Shim
61
1
0
21 Feb 2024
MORE-3S:Multimodal-based Offline Reinforcement Learning with Shared Semantic Spaces
Tianyu Zheng
Ge Zhang
Xingwei Qu
Ming Kuang
Stephen W. Huang
Zhaofeng He
OffRL
121
1
0
20 Feb 2024
Cobra Effect in Reference-Free Image Captioning Metrics
Zheng Ma
Changxin Wang
Yawen Ouyang
Fei Zhao
Jianbing Zhang
Shujian Huang
Jiajun Chen
90
2
0
18 Feb 2024
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition
Jinyuan Li
Han Li
Di Sun
Jiahao Wang
Wenkun Zhang
Zan Wang
Gang Pan
101
7
0
15 Feb 2024
Multimodal Rationales for Explainable Visual Question Answering
Kun Li
G. Vosselman
Michael Ying Yang
130
2
0
06 Feb 2024
Instruction Makes a Difference
Tosin Adewumi
Nudrat Habib
Lama Alkhaled
Elisa Barney
VLM
MLLM
69
1
0
01 Feb 2024
Improving Data Augmentation for Robust Visual Question Answering with Effective Curriculum Learning
Yuhang Zheng
Zhen Wang
Long Chen
59
2
0
28 Jan 2024
Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews)
Shih-Han Chou
Matthew Kowal
Yasmin Niknam
Diana Moyano
Shayaan Mehdi
...
Cheng Zhang
Ian Knopke
S. Kocak
Leonid Sigal
Yalda Mohsenzadeh
137
1
0
23 Jan 2024
Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images
Kuofeng Gao
Yang Bai
Jindong Gu
Shu-Tao Xia
Philip Torr
Zhifeng Li
Wei Liu
VLM
86
47
0
20 Jan 2024
Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation
Kohei Uehara
Nabarun Goswami
Hanqin Wang
Toshiaki Baba
Kohtaro Tanaka
...
Takagi Naoya
Ryo Umagami
Yingyi Wen
Tanachai Anakewat
Tatsuya Harada
LRM
65
3
0
18 Jan 2024
KTVIC: A Vietnamese Image Captioning Dataset on the Life Domain
Anh-Cuong Pham
Van-Quang Nguyen
Thi-Hong Vuong
Quang-Thuy Ha
53
1
0
16 Jan 2024
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
Zongxin Yang
Guikun Chen
Xiaodi Li
Wenguan Wang
Yi Yang
LM&Ro
LLMAG
177
41
0
16 Jan 2024
Uncovering the Full Potential of Visual Grounding Methods in VQA
Daniel Reich
Tanja Schultz
102
5
0
15 Jan 2024
MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation
Jiaqi Chen
Bingqian Lin
Ran Xu
Zhenhua Chai
Xiaodan Liang
Kwan-Yee K. Wong
LM&Ro
LLMAG
82
32
0
14 Jan 2024
Previous
1
2
3
4
5
...
36
37
38
Next