Multi-Scale Self-Attention for Text Classification

doi:10.48550/arXiv.1912.00544

Multi-Scale Self-Attention for Text Classification

In this paper, we introduce the prior knowledge, multi-scale structure, into self-attention modules. We propose a Multi-Scale Transformer which uses multi-scale multi-head self-attention to capture features from different scales. Based on the linguistic perspective and the analysis of pre-trained Transformer (BERT) on a huge corpus, we further design a strategy to control the scale distribution for each layer. Results of three different kinds of tasks (21 datasets) show our Multi-Scale Transformer outperforms the standard Transformer consistently and significantly on small and moderate size datasets.

Publication:

arXiv e-prints

Pub Date:

December 2019

DOI:

10.48550/arXiv.1912.00544

arXiv:

arXiv:1912.00544

Bibcode:

2019arXiv191200544G

Keywords:

Computer Science - Computation and Language;
Computer Science - Machine Learning

E-Print:

Accepted in AAAI2020

NASA/ADS

Multi-Scale Self-Attention for Text Classification

Abstract