NODE Transformer: A DepthAdaptive Variant of the Transformer Using Neural Ordinary Differential Equations
Abstract
We use neural ordinary differential equations to formulate a variant of the Transformer that is depthadaptive in the sense that an inputdependent number of time steps is taken by the ordinary differential equation solver. Our goal in proposing the NODE Transformer is to investigate whether its depthadaptivity may aid in overcoming some specific known theoretical limitations of the Transformer in handling nonlocal effects. Specifically, we consider the simple problem of determining the parity of a binary sequence, for which the standard Transformer has known limitations that can only be overcome by using a sufficiently large number of layers or attention heads. We find, however, that the depthadaptivity of the NODE Transformer does not provide a remedy for the inherently nonlocal nature of the parity problem, and provide explanations for why this is so. Next, we pursue regularization of the NODE Transformer by penalizing the arclength of the ODE trajectories, but find that this fails to improve the accuracy or efficiency of the NODE Transformer on the challenging parity problem. We suggest future avenues of research for modifications and extensions of the NODE Transformer that may lead to improved accuracy and efficiency for sequence modelling tasks such as neural machine translation.
 Publication:

arXiv eprints
 Pub Date:
 October 2020
 DOI:
 10.48550/arXiv.2010.11358
 arXiv:
 arXiv:2010.11358
 Bibcode:
 2020arXiv201011358B
 Keywords:

 Computer Science  Machine Learning;
 Computer Science  Computation and Language