On the Last Iterate Convergence of Momentum Methods
Abstract
SGD with Momentum (SGDM) is a widely used family of algorithms for largescale optimization of machine learning problems. Yet, when optimizing generic convex functions, no advantage is known for any SGDM algorithm over plain SGD. Moreover, even the most recent results require changes to the SGDM algorithms, like averaging of the iterates and a projection onto a bounded domain, which are rarely used in practice. In this paper, we focus on the convergence rate of the last iterate of SGDM. For the first time, we prove that for any constant momentum factor, there exists a Lipschitz and convex function for which the last iterate of SGDM suffers from a suboptimal convergence rate of $\Omega(\frac{\ln T}{\sqrt{T}})$ after $T$ iterations. Based on this fact, we study a class of (both adaptive and nonadaptive) FollowTheRegularizedLeaderbased SGDM algorithms with increasing momentum and shrinking updates. For these algorithms, we show that the last iterate has optimal convergence $O(\frac{1}{\sqrt{T}})$ for unconstrained convex stochastic optimization problems without projections onto bounded domains nor knowledge of $T$. Further, we show a variety of results for FTRLbased SGDM when used with adaptive stepsizes. Empirical results are shown as well.
 Publication:

arXiv eprints
 Pub Date:
 February 2021
 DOI:
 10.48550/arXiv.2102.07002
 arXiv:
 arXiv:2102.07002
 Bibcode:
 2021arXiv210207002L
 Keywords:

 Computer Science  Machine Learning;
 Mathematics  Optimization and Control;
 Statistics  Machine Learning
 EPrint:
 Differences with ALT'22 camera ready: Clarified the statement of the lower bound