TransCenter: Transformers with Dense Representations for Multiple-Object Tracking
- VOT

Transformers have proven superior performance for a wide variety of tasks since they were introduced, which has drawn in recent years the attention of the vision community where efforts were made such as image classification and object detection. Despite this wave, building an accurate and efficient multiple-object tracking (MOT) method with transformers is not a trivial task. We argue that the direct application of a transformer architecture with quadratic complexity and insufficient noise-initialized sparse queries -- is not optimal for MOT. Inspired by recent research, we propose TransCenter, a transformer-based MOT architecture with dense representations for accurately tracking all the objects while keeping a reasonable runtime. Methodologically, we propose the use of dense image-related multi-scale detection queries produced by an efficient transformer architecture. The queries allow inferring targets' locations globally and robustly from dense heatmap outputs. In parallel, a set of efficient sparse tracking queries interacting with image features in the TransCenter Decoder to associate object positions through time. TransCenter exhibits remarkable performance improvements and outperforms by a large margin the current state-of-the-art in two standard MOT benchmarks with two tracking (public/private) settings. The proposed efficient and accurate transformer architecture for MOT is proven with an extensive ablation study, demonstrating its advantage compared to more naive alternatives and concurrent works. The code will be made publicly available at https://github.com/yihongxu/transcenter.
View on arXiv