ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1707.07998
  4. Cited By
Bottom-Up and Top-Down Attention for Image Captioning and Visual
  Question Answering
v1v2v3 (latest)

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

25 July 2017
Peter Anderson
Xiaodong He
Chris Buehler
Damien Teney
Mark Johnson
Stephen Gould
Lei Zhang
    AIMat
ArXiv (abs)PDFHTML

Papers citing "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering"

50 / 1,868 papers shown
Title
Medical Vision-Language Pre-Training for Brain Abnormalities
Medical Vision-Language Pre-Training for Brain Abnormalities
Masoud Monajatipoor
Zi-Yi Dou
Aichi Chien
Nanyun Peng
Kai-Wei Chang
VLM
103
0
0
27 Apr 2024
Exploring the Distinctiveness and Fidelity of the Descriptions Generated
  by Large Vision-Language Models
Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models
Yuhang Huang
Zihan Wu
Chongyang Gao
Jiawei Peng
Xu Yang
73
2
0
26 Apr 2024
Learning text-to-video retrieval from image captioning
Learning text-to-video retrieval from image captioning
Lucas Ventura
Cordelia Schmid
Gül Varol
3DV
71
3
0
26 Apr 2024
3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial
  Self-Highlighting
3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting
Xuri Ge
Songpei Xu
Fuhai Chen
Jie Wang
Guoxin Wang
Shan An
Joemon M. Jose
3DPC
110
12
0
26 Apr 2024
Energy-Latency Manipulation of Multi-modal Large Language Models via
  Verbose Samples
Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples
Kuofeng Gao
Jindong Gu
Yang Bai
Shu-Tao Xia
Philip Torr
Wei Liu
Zhifeng Li
132
13
0
25 Apr 2024
Self-Bootstrapped Visual-Language Model for Knowledge Selection and
  Question Answering
Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering
Dongze Hao
Qunbo Wang
Longteng Guo
Jie Jiang
Jing Liu
65
1
0
22 Apr 2024
Sentiment-oriented Transformer-based Variational Autoencoder Network for
  Live Video Commenting
Sentiment-oriented Transformer-based Variational Autoencoder Network for Live Video Commenting
Fengyi Fu
Shancheng Fang
Weidong Chen
Zhendong Mao
ViTVGen
56
4
0
19 Apr 2024
Resilience through Scene Context in Visual Referring Expression
  Generation
Resilience through Scene Context in Visual Referring Expression Generation
Simeon Junker
Sina Zarrieß
49
1
0
18 Apr 2024
Dynamic Self-adaptive Multiscale Distillation from Pre-trained
  Multimodal Large Model for Efficient Cross-modal Representation Learning
Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning
Zhengyang Liang
Meiyu Liang
Wei Huang
Yawen Li
Zhe Xue
74
1
0
16 Apr 2024
Find The Gap: Knowledge Base Reasoning For Visual Question Answering
Find The Gap: Knowledge Base Reasoning For Visual Question Answering
Elham J. Barezi
Parisa Kordjamshidi
58
1
0
16 Apr 2024
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
Quan Van Nguyen
Dan Quang Tran
Huy Quang Pham
Thang Kien-Bao Nguyen
Nghia Hieu Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
CoGe
170
5
0
16 Apr 2024
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
Jintao Sun
Zhedong Zheng
Gangyi Ding
Gangyi Ding
122
8
0
16 Apr 2024
`Eyes of a Hawk and Ears of a Fox': Part Prototype Network for
  Generalized Zero-Shot Learning
`Eyes of a Hawk and Ears of a Fox': Part Prototype Network for Generalized Zero-Shot Learning
Joshua Forster Feinglass
Jayaraman J. Thiagarajan
Rushil Anirudh
T. S. Jayram
Yezhou Yang
VLM
54
0
0
12 Apr 2024
Improving Continuous Sign Language Recognition with Adapted Image Models
Improving Continuous Sign Language Recognition with Adapted Image Models
Lianyu Hu
Tongkai Shi
Liqing Gao
Zekang Liu
Wei Feng
VLM
86
5
0
12 Apr 2024
Text Data-Centric Image Captioning with Interactive Prompts
Text Data-Centric Image Captioning with Interactive Prompts
Yiyu Wang
Hao Luo
Jungang Xu
Yingfei Sun
Fan Wang
VLM
76
0
0
28 Mar 2024
Beyond Embeddings: The Promise of Visual Table in Visual Reasoning
Beyond Embeddings: The Promise of Visual Table in Visual Reasoning
Yiwu Zhong
Zi-Yuan Hu
Michael R. Lyu
Liwei Wang
57
1
0
27 Mar 2024
Semi-Supervised Image Captioning Considering Wasserstein Graph Matching
Semi-Supervised Image Captioning Considering Wasserstein Graph Matching
Yang Yang
94
0
0
26 Mar 2024
Image Captioning in news report scenario
Image Captioning in news report scenario
Tianrui Liu
Qi Cai
Changxin Xu
Bo Hong
Jize Xiong
Yuxin Qiao
Tsungwei Yang
83
14
0
24 Mar 2024
Temporal-Spatial Object Relations Modeling for Vision-and-Language
  Navigation
Temporal-Spatial Object Relations Modeling for Vision-and-Language Navigation
Bowen Huang
Yanwei Zheng
Chuanlin Lan
Xinpeng Zhao
Yifei Zou
Dongxiao Yu
104
0
0
23 Mar 2024
Can 3D Vision-Language Models Truly Understand Natural Language?
Can 3D Vision-Language Models Truly Understand Natural Language?
Weipeng Deng
Jihan Yang
Runyu Ding
Jiahui Liu
Yijiang Li
Xiaojuan Qi
Edith C.H. Ngai
116
6
0
21 Mar 2024
HyperFusion: A Hypernetwork Approach to Multimodal Integration of Tabular and Medical Imaging Data for Predictive Modeling
HyperFusion: A Hypernetwork Approach to Multimodal Integration of Tabular and Medical Imaging Data for Predictive Modeling
Daniel Duenias
Brennan Nichyporuk
Tal Arbel
Tammy Riklin-Raviv
95
7
0
20 Mar 2024
Hierarchical Spatial Proximity Reasoning for Vision-and-Language
  Navigation
Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation
Ming Xu
Zilong Xie
100
2
0
18 Mar 2024
Knowledge Condensation and Reasoning for Knowledge-based VQA
Knowledge Condensation and Reasoning for Knowledge-based VQA
Dongze Hao
Jian Jia
Longteng Guo
Qunbo Wang
Te Yang
...
Yanhua Cheng
Bo Wang
Quan Chen
Han Li
Jing Liu
74
1
0
15 Mar 2024
Adversarial Training with OCR Modality Perturbation for Scene-Text
  Visual Question Answering
Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering
Zhixuan Shen
Haonan Luo
Sijia Li
Tianrui Li
68
0
0
14 Mar 2024
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing
  Objects in 3D Scenes
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes
Ting Yu
Xiaojun Lin
Shuhui Wang
Weiguo Sheng
Qingming Huang
Jun-chen Yu
3DV
88
10
0
12 Mar 2024
Enhancing Image Caption Generation Using Reinforcement Learning with
  Human Feedback
Enhancing Image Caption Generation Using Reinforcement Learning with Human Feedback
L. AdarshN
V. ArunP
L. AravindhN
39
3
0
11 Mar 2024
How to Understand Named Entities: Using Common Sense for News Captioning
How to Understand Named Entities: Using Common Sense for News Captioning
Ning Xu
Yanhui Wang
Tingting Zhang
Hongshuo Tian
Mohan Kankanhalli
An-An Liu
63
0
0
11 Mar 2024
HistGen: Histopathology Report Generation via Local-Global Feature
  Encoding and Cross-modal Context Interaction
HistGen: Histopathology Report Generation via Local-Global Feature Encoding and Cross-modal Context Interaction
Zhengrui Guo
Jiabo Ma
Ying Xu
Yihui Wang
Liansheng Wang
Hao Chen
110
22
0
08 Mar 2024
MeaCap: Memory-Augmented Zero-shot Image Captioning
MeaCap: Memory-Augmented Zero-shot Image Captioning
Zequn Zeng
Yan Xie
Hao Zhang
Chiyu Chen
Zhengjue Wang
Boli Chen
VLM
86
15
0
06 Mar 2024
Causality-based Cross-Modal Representation Learning for
  Vision-and-Language Navigation
Causality-based Cross-Modal Representation Learning for Vision-and-Language Navigation
Liuyi Wang
Zongtao He
Ronghao Dang
Huiyi Chen
Chengju Liu
Qi Chen
94
1
0
06 Mar 2024
VEglue: Testing Visual Entailment Systems via Object-Aligned Joint
  Erasing
VEglue: Testing Visual Entailment Systems via Object-Aligned Joint Erasing
Zhiyuan Chang
Mingyang Li
Junjie Wang
Cheng Li
Qing Wang
58
0
0
05 Mar 2024
Zero-shot Generalizable Incremental Learning for Vision-Language Object
  Detection
Zero-shot Generalizable Incremental Learning for Vision-Language Object Detection
Jieren Deng
Haojian Zhang
Kun Ding
Jianhua Hu
Xingxuan Zhang
Yunkuan Wang
VLMObjD
167
7
0
04 Mar 2024
Navigating Hallucinations for Reasoning of Unintentional Activities
Navigating Hallucinations for Reasoning of Unintentional Activities
Shresth Grover
Vibhav Vineet
Yogesh S Rawat
LRM
83
1
0
29 Feb 2024
VIXEN: Visual Text Comparison Network for Image Difference Captioning
VIXEN: Visual Text Comparison Network for Image Difference Captioning
Alexander Black
Jing Shi
Yifei Fai
Tu Bui
John Collomosse
72
5
0
29 Feb 2024
Polos: Multimodal Metric Learning from Human Feedback for Image
  Captioning
Polos: Multimodal Metric Learning from Human Feedback for Image Captioning
Yuiga Wada
Kanta Kaneda
Daichi Saito
Komei Sugiura
89
30
0
28 Feb 2024
Measuring Vision-Language STEM Skills of Neural Models
Measuring Vision-Language STEM Skills of Neural Models
Jianhao Shen
Ye Yuan
Srbuhi Mirzoyan
Ming Zhang
Chenguang Wang
VLM
117
12
0
27 Feb 2024
Self-Supervised Interpretable End-to-End Learning via Latent Functional
  Modularity
Self-Supervised Interpretable End-to-End Learning via Latent Functional Modularity
Hyunki Seong
David Hyunchul Shim
61
1
0
21 Feb 2024
MORE-3S:Multimodal-based Offline Reinforcement Learning with Shared
  Semantic Spaces
MORE-3S:Multimodal-based Offline Reinforcement Learning with Shared Semantic Spaces
Tianyu Zheng
Ge Zhang
Xingwei Qu
Ming Kuang
Stephen W. Huang
Zhaofeng He
OffRL
121
1
0
20 Feb 2024
Cobra Effect in Reference-Free Image Captioning Metrics
Cobra Effect in Reference-Free Image Captioning Metrics
Zheng Ma
Changxin Wang
Yawen Ouyang
Fei Zhao
Jianbing Zhang
Shujian Huang
Jiajun Chen
90
2
0
18 Feb 2024
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity
  Recognition
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition
Jinyuan Li
Han Li
Di Sun
Jiahao Wang
Wenkun Zhang
Zan Wang
Gang Pan
101
7
0
15 Feb 2024
Multimodal Rationales for Explainable Visual Question Answering
Multimodal Rationales for Explainable Visual Question Answering
Kun Li
G. Vosselman
Michael Ying Yang
130
2
0
06 Feb 2024
Instruction Makes a Difference
Instruction Makes a Difference
Tosin Adewumi
Nudrat Habib
Lama Alkhaled
Elisa Barney
VLMMLLM
69
1
0
01 Feb 2024
Improving Data Augmentation for Robust Visual Question Answering with
  Effective Curriculum Learning
Improving Data Augmentation for Robust Visual Question Answering with Effective Curriculum Learning
Yuhang Zheng
Zhen Wang
Long Chen
59
2
0
28 Jan 2024
Multi-modal News Understanding with Professionally Labelled Videos
  (ReutersViLNews)
Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews)
Shih-Han Chou
Matthew Kowal
Yasmin Niknam
Diana Moyano
Shayaan Mehdi
...
Cheng Zhang
Ian Knopke
S. Kocak
Leonid Sigal
Yalda Mohsenzadeh
137
1
0
23 Jan 2024
Inducing High Energy-Latency of Large Vision-Language Models with
  Verbose Images
Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images
Kuofeng Gao
Yang Bai
Jindong Gu
Shu-Tao Xia
Philip Torr
Zhifeng Li
Wei Liu
VLM
86
47
0
20 Jan 2024
Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and
  Visual Question Generation
Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation
Kohei Uehara
Nabarun Goswami
Hanqin Wang
Toshiaki Baba
Kohtaro Tanaka
...
Takagi Naoya
Ryo Umagami
Yingyi Wen
Tanachai Anakewat
Tatsuya Harada
LRM
65
3
0
18 Jan 2024
KTVIC: A Vietnamese Image Captioning Dataset on the Life Domain
KTVIC: A Vietnamese Image Captioning Dataset on the Life Domain
Anh-Cuong Pham
Van-Quang Nguyen
Thi-Hong Vuong
Quang-Thuy Ha
53
1
0
16 Jan 2024
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
Zongxin Yang
Guikun Chen
Xiaodi Li
Wenguan Wang
Yi Yang
LM&RoLLMAG
177
41
0
16 Jan 2024
Uncovering the Full Potential of Visual Grounding Methods in VQA
Uncovering the Full Potential of Visual Grounding Methods in VQA
Daniel Reich
Tanja Schultz
102
5
0
15 Jan 2024
MapGPT: Map-Guided Prompting with Adaptive Path Planning for
  Vision-and-Language Navigation
MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation
Jiaqi Chen
Bingqian Lin
Ran Xu
Zhenhua Chai
Xiaodan Liang
Kwan-Yee K. Wong
LM&RoLLMAG
82
32
0
14 Jan 2024
Previous
12345...363738
Next