PVTv2: Improved Baselines with Pyramid Vision Transformer

25 June 2021

Wenhai Wang

Xiang Li

Ping Luo

ArXiv (abs)PDF HTML Github (1814★)

Abstract

Transformer recently has presented encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVTv1) by adding three designs, including (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network. With these modifications, PVTv2 reduces the computational complexity of PVTv1 to linear and achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation. Notably, the proposed PVTv2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT.

View on arXiv

Comments on this paper