Implicit Kernel Attention

AAAI Conference on Artificial Intelligence (AAAI), 2020

11 June 2020

Abstract

Attention compute the dependency between representations, and it encourages the model to focus on the important selective features. Among the attention methods, the scaled dot-product attention is widely utilized in many models. This paper suggests a generalized structure of the scaled dot-product attention with similarity and magnitude terms. We derive that the scaled dot-product attention is a product of two parts: 1) the RBF kernel to measure the similarity of two instances and 2) the exponential $L^{2}$ norm to compute the importance of individual instances. From this decomposition, we improve the attention in two ways: implicit modeling on the kernel spectral density and generalized $L^{p}$ norm, which results in a learnable and flexible attention structure. First, we estimate the spectral density of kernel with implicit probabilistic models to estimate the appropriate kernel for a given dataset without kernel selection manually. Second, we introduce a generalized $L^p$ norm on the hidden feature space, where $p$ is a hyper-parameter that affects the scale of individual importance and the sparsity of attention weights. Also, we show how to expand this implicit kernel modeling to multi-head attention in conjunction with a copula augmentation. Our generalized attention shows better performance on text classification, translation, regression, and node classification tasks.

View on arXiv

Comments on this paper