One Step of Gradient Descent is Provably the Optimal InContext Learner with One Layer of Linear SelfAttention
Abstract
Recent works have empirically analyzed incontext learning and shown that transformers trained on synthetic linear regression tasks can learn to implement ridge regression, which is the Bayesoptimal predictor, given sufficient capacity [Akyürek et al., 2023], while onelayer transformers with linear selfattention and no MLP layer will learn to implement one step of gradient descent (GD) on a leastsquares linear regression objective [von Oswald et al., 2022]. However, the theory behind these observations remains poorly understood. We theoretically study transformers with a single layer of linear selfattention, trained on synthetic noisy linear regression data. First, we mathematically show that when the covariates are drawn from a standard Gaussian distribution, the onelayer transformer which minimizes the pretraining loss will implement a single step of GD on the leastsquares linear regression objective. Then, we find that changing the distribution of the covariates and weight vector to a nonisotropic Gaussian distribution has a strong impact on the learned algorithm: the global minimizer of the pretraining loss now implements a single step of $\textit{preconditioned}$ GD. However, if only the distribution of the responses is changed, then this does not have a large effect on the learned algorithm: even when the response comes from a more general family of $\textit{nonlinear}$ functions, the global minimizer of the pretraining loss still implements a single step of GD on a leastsquares linear regression objective.
 Publication:

arXiv eprints
 Pub Date:
 July 2023
 DOI:
 10.48550/arXiv.2307.03576
 arXiv:
 arXiv:2307.03576
 Bibcode:
 2023arXiv230703576M
 Keywords:

 Computer Science  Machine Learning