Adaptive Gradient Methods Converge Faster with OverParameterization (but you should do a linesearch)
Abstract
Adaptive gradient methods are typically used for training overparameterized models. To better understand their behaviour, we study a simplistic setting  smooth, convex losses with models overparameterized enough to interpolate the data. In this setting, we prove that AMSGrad with constant stepsize and momentum converges to the minimizer at a faster $O(1/T)$ rate. When interpolation is only approximately satisfied, constant stepsize AMSGrad converges to a neighbourhood of the solution at the same rate, while AdaGrad is robust to the violation of interpolation. However, even for simple convex problems satisfying interpolation, the empirical performance of both methods heavily depends on the stepsize and requires tuning, questioning their adaptivity. We alleviate this problem by automatically determining the stepsize using stochastic linesearch or Polyak stepsizes. With these techniques, we prove that both AdaGrad and AMSGrad retain their convergence guarantees, without needing to know problemdependent constants. Empirically, we demonstrate that these techniques improve the convergence and generalization of adaptive gradient methods across tasks, from binary classification with kernel mappings to multiclass classification with deep networks.
 Publication:

arXiv eprints
 Pub Date:
 June 2020
 DOI:
 10.48550/arXiv.2006.06835
 arXiv:
 arXiv:2006.06835
 Bibcode:
 2020arXiv200606835V
 Keywords:

 Computer Science  Machine Learning;
 Mathematics  Optimization and Control;
 Statistics  Machine Learning