132

Local Monotonic Attention Mechanism for End-to-End Speech Recognition

Abstract

Recently, sequence-to-sequence model by using encoder-decoder neural network has gained popularity for automatic speech recognition (ASR). The architecture commonly uses an attentional mechanism which allows the model to learn alignments between source speech sequence and target text sequence. Most attentional mechanisms used today is based on a global attention property which requires a computation of a weighted summarization of the whole input sequence generated by encoder states. However, it is computationally expensive and often produces misalignment on the longer input sequence. Furthermore, it does not fit with monotonous or left-to-right nature in speech recognition task. In this paper, we propose a novel attention mechanism that has local and monotonic properties. Various ways to control those properties are also explored. Experimental results demonstrate that encoder-decoder based ASR with local monotonic attention could achieve significant performance improvements and reduce the computational complexity in comparison with the one that used the standard global attention architecture.

View on arXiv
Comments on this paper