Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network

13 December 2020

Jiayi Ji

Yongjian Wu

Papers citing "Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network"

20 / 20 papers shown

Title
Tri-FusionNet: Enhancing Image Description Generation with Transformer-based Fusion Network and Dual Attention Mechanism Lakshita Agarwal Bindu Verma ViT 27 0 0 23 Apr 2025
Multi-Granular Multimodal Clue Fusion for Meme Understanding Li Zheng Hao Fei Ting Dai Zuquan Peng Fei Li Huisheng Ma Chong Teng Donghong Ji 60 0 0 16 Mar 2025
Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling Yu Zhao Hao Fei Yixin Cao Bobo Li Meishan Zhang Jianguo Wei M. Zhang Tat-Seng Chua 17 13 0 09 Aug 2023
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning Chia-Wen Kuo Z. Kira 31 21 0 25 May 2023
A request for clarity over the End of Sequence token in the Self-Critical Sequence Training J. Hu Roberto Cavicchioli Alessandro Capotondi 26 6 0 20 May 2023
$SnakeVoxFormer: Transformer-based Single Image\\Voxel Reconstruction with Run Length Encoding$ SnakeVoxFormer: Transformer-based Single Image\\Voxel Reconstruction with Run Length Encoding Jae Joong Lee Bedrich Benes ViT 24 0 0 28 Mar 2023
Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network Haowei Wang Jiayi Ji Yiyi Zhou Yongjian Wu Xiaoshuai Sun 30 15 0 09 Jan 2023
Controllable Image Captioning via Prompting Ning Wang Jiahao Xie Jihao Wu Mingbo Jia Linlin Li 19 23 0 04 Dec 2022
How to Describe Images in a More Funny Way? Towards a Modular Approach to Cross-Modal Sarcasm Generation Jie Ruan Yue Wu Xiaojun Wan Yuesheng Zhu 29 1 0 20 Nov 2022
Progressive Tree-Structured Prototype Network for End-to-End Image Captioning Pengpeng Zeng Jinkuan Zhu Jingkuan Song Lianli Gao VLM 22 27 0 17 Nov 2022
Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia K. Nguyen Ali Furkan Biten Andrés Mafla Lluís Gómez Dimosthenis Karatzas 33 10 0 21 Sep 2022
GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement Zhi-Qi Cheng Qianwen Dai Siyao Li Teruko Mitamura Alexander G. Hauptmann 16 34 0 18 Aug 2022
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features Van-Quang Nguyen Masanori Suganuma Takayuki Okatani ViT 30 106 0 20 Jul 2022
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval Yiwei Ma Guohai Xu Xiaoshuai Sun Ming Yan Ji Zhang Rongrong Ji CLIP VLM 22 269 0 15 Jul 2022
Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review Hao Wang Bin Guo Y. Zeng Yasan Ding Chen Qiu Ying Zhang Li Yao Zhiwen Yu 30 2 0 02 Jul 2022
End-to-End Transformer Based Model for Image Captioning Yiyu Wang Jungang Xu Yingfei Sun VLM ViT 26 117 0 29 Mar 2022
ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic Yoad Tewel Yoav Shalev Idan Schwartz Lior Wolf VLM 34 192 0 29 Nov 2021
Self-Annotated Training for Controllable Image Captioning Zhangzi Zhu Tianlei Wang Hong Qu 27 2 0 16 Oct 2021
From Show to Tell: A Survey on Deep Learning-based Image Captioning Matteo Stefanini Marcella Cornia Lorenzo Baraldi S. Cascianelli G. Fiameni Rita Cucchiara 3DV VLM MLLM 67 254 0 14 Jul 2021
Normalized and Geometry-Aware Self-Attention Network for Image Captioning Longteng Guo Jing Liu Xinxin Zhu Peng Yao Shichen Lu Hanqing Lu ViT 120 189 0 19 Mar 2020