PVTv2: Improved Baselines with Pyramid Vision Transformer

25 June 2021

Wenhai Wang

Xiang Li

Ping Luo

ArXiv (abs)PDF HTML Github (1814★)

Abstract

Transformer in computer vision has recently shown encouraging progress. In this work, we improve the original Pyramid Vision Transformer (PVTv1) by adding three improvement designs, which include (1) overlapping patch embedding, (2) convolutional feed-forward networks, and (3) linear complexity attention layers. With these simple modifications, our PVTv2 significantly improves PVTv1 on classification, detection, and segmentation. Moreover, PVTv2 achieves better performance than recent works, including Swin Transformer. We hope this work will make state-of-the-art vision Transformer research more accessible. Code is available at https://github.com/whai362/PVT .

View on arXiv

Comments on this paper