Transformers, originally proposed for natural language processing (NLP)
tasks, have recently achieved great success in automatic speech recognition
(ASR). However, adjacent acoustic units (i.e., frames) are highly correlated,
and long-distance dependencies between them are weak, unlike text units. It
suggests that ASR will likely benefit from sparse and localized attention. In
this paper, we propose Weak-Attention Suppression (WAS), a method that
dynamically induces sparsity in attention probabilities. We demonstrate that
WAS leads to consistent Word Error Rate (WER) improvement over strong
transformer baselines. On the widely used LibriSpeech benchmark, our proposed
method reduced WER by 10%ontest−cleanand5transformers,resultinginanewstate−of−the−artamongstreamingmodels.FurtheranalysisshowsthatWASlearnstosuppressattentionofnon−criticalandredundantcontinuousacousticframes,andismorelikelytosuppresspastframesratherthanfutureones.Itindicatestheimportanceoflookaheadinattention−basedASRmodels.