BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance

Pretraining deep contextualized representations using an unsupervised language modeling objective has led to large performance gains for a variety of NLP tasks. Despite this success, recent work by Schick and Sch\"utze (2019) suggests that these architectures struggle to understand rare words. For context-independent word embeddings, this problem can be addressed by separately learning representations for infrequent words. In this work, we show that the same idea can also be applied to contextualized models and clearly improves their downstream task performance. Most approaches for inducing word embeddings into existing embedding spaces are based on simple bag-of-words models; hence they are not a suitable counterpart for deep neural network language models. To overcome this problem, we introduce BERTRAM, a powerful architecture based on a pretrained BERT language model and capable of inferring high-quality representations for rare words. In BERTRAM, surface form and contexts of a word directly interact with each other in a deep architecture. Both on a rare word probing task and on three downstream task datasets, BERTRAM considerably improves representations for rare and medium frequency words compared to both a standalone BERT model and previous work.
View on arXiv