Machine learning models, especially based on deep learning are used in everyday applications ranging from self driving cars to medical diagnostics. However, it is easy to trick such models using adversarial samples, indistinguishable from real samples to human eye, such samples can bias models towards incorrect classification. Impact of adversarial samples is far-reaching and efficient detection of adversarial samples remains an open problem. In this paper we propose to use direct density ratio estimation to detect adversarial samples, we empirically show that adversarial samples have different underlying probability densities compared to real samples. Our proposed method works well with colored and grayscale images, and with different adversarial sample generation methods.
View on arXiv