235
1480

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Abstract

We present DINO (\textbf{D}ETR with \textbf{I}mproved de\textbf{N}oising anch\textbf{O}r boxes), a state-of-the-art end-to-end object detector. % in this paper. DINO improves over previous DETR-like models in performance and efficiency by using a contrastive way for denoising training, a mixed query selection method for anchor initialization, and a look forward twice scheme for box prediction. DINO achieves 48.348.3AP in 1212 epochs and 51.051.0AP in 3636 epochs on COCO with a ResNet-50 backbone and multi-scale features, yielding a significant improvement of +4.9\textbf{+4.9}\textbf{AP} and +2.4\textbf{+2.4}\textbf{AP}, respectively, compared to DN-DETR, the previous best DETR-like model. DINO scales well in both model size and data size. Without bells and whistles, after pre-training on the Objects365 dataset with a SwinL backbone, DINO obtains the best results on both COCO \texttt{val2017} (63.2\textbf{63.2}\textbf{AP}) and \texttt{test-dev} (\textbf{63.3\textbf{63.3}AP}). Compared to other models on the leaderboard, DINO significantly reduces its model size and pre-training data size while achieving better results. Our code will be available at \url{https://github.com/IDEACVR/DINO}.

View on arXiv
Comments on this paper