How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

doi:10.48550/arXiv.2211.03495

How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones -- the average attention weights over multiple inputs. We use PAPA to analyze several established pretrained Transformers on six downstream tasks. We find that without any input-dependent attention, all models achieve competitive performance -- an average relative drop of only 8% from the probing baseline. Further, little or no performance drop is observed when replacing half of the input-dependent attention matrices with constant (input-independent) ones. Interestingly, we show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success. Our results motivate research on simpler alternatives to input-dependent attention, as well as on methods for better utilization of this mechanism in the Transformer architecture.

Publication:

arXiv e-prints

Pub Date:

November 2022

DOI:

10.48550/arXiv.2211.03495

arXiv:

arXiv:2211.03495

Bibcode:

2022arXiv221103495H

Keywords:

Computer Science - Computation and Language;
Computer Science - Machine Learning

E-Print:

Findings of EMNLP 2022

NASA/ADS

How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

Abstract