ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2001.04589
47
16

Faster Transformer Decoding: N-gram Masked Self-Attention

14 January 2020
Ciprian Chelba
Mengzhao Chen
Ankur Bapna
Noam M. Shazeer
ArXiv (abs)PDFHTML
Abstract

Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence S=s1,…,sSS=s_1, \ldots, s_SS=s1​,…,sS​, we propose truncating the target-side window used for computing self-attention by making an NNN-gram assumption. Experiments on WMT EnDe and EnFr data sets show that the NNN-gram masked self-attention model loses very little in BLEU score for NNN values in the range 4,…,84, \ldots, 84,…,8, depending on the task.

View on arXiv
Comments on this paper