Catch the "Tails" of BERT
Recently, contextualized word embeddings outperform static word embeddings on many NLP tasks. However, we still do not know much about the mechanism inside these representations. Do they have any common patterns? If so, where do these patterns come from? We find that almost all the contextualized word vectors of BERT and RoBERTa have a common pattern. For BERT, the element is always the smallest. For RoBERTa, the element is always the largest, and the element is the smallest. We call them "tails" of models. We introduce a new neuron-level method to analyze where these "tails" come from. We find that these "tails" are closely related to the positional information. We also investigate what will happen if we "cutting the tails" (zero-out). Our results show that "tails" are the major cause of anisotropy of vector space. After "cutting the tails", a word's different vectors are more similar to each other. The internal representations have a better ability to distinguish a word's different senses with the word-in-context (WiC) dataset. The performance on the word sense disambiguation task is better for BERT and unchanged for RoBERTa. We can also better induce phrase grammar from the vector space. These suggest that "tails" are less related to the sense and syntax information in vectors. These findings provide insights into the inner workings of contextualized word vectors.
View on arXiv