46
25

Semi-Autoregressive Transformer for Image Captioning

Yuanen Zhou
Yong Zhang
Zhenzhen Hu
Meng Wang
Abstract

Current state-of-the-art image captioning models adopt autoregressive decoders, \ie they generate each word by conditioning on previously generated words, which leads to heavy latency during inference. To tackle this issue, non-autoregressive image captioning models have recently been proposed to significantly accelerate the speed of inference by generating all words in parallel. However, these non-autoregressive models inevitably suffer from large generation quality degradation since they remove words dependence excessively. To make a better trade-off between speed and quality, we introduce a semi-autoregressive model for image captioning~(dubbed as SATIC), which keeps the autoregressive property in global but generates words parallelly in local . Based on Transformer, there are only a few modifications needed to implement SATIC. Experimental results on the MSCOCO image captioning benchmark show that SATIC can achieve a good trade-off without bells and whistles. Code is available at {\color{magenta}\url{https://github.com/YuanEZhou/satic}}.

View on arXiv
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.