Bandwidth-based Step-Sizes for Non-Convex Stochastic Optimization
Abstract
Many popular learning-rate schedules for deep neural networks combine a decaying trend with local perturbations that attempt to escape saddle points and bad local minima. We derive convergence guarantees for bandwidth-based step-sizes, a general class of learning rates that are allowed to vary in a banded region. This framework includes many popular cyclic and non-monotonic step-sizes for which no theoretical guarantees were previously known. We provide worst-case guarantees for SGD on smooth non-convex problems under several bandwidth-based step sizes, including stagewise $1/\sqrt{t}$ and the popular step-decay (constant and then drop by a constant), which is also shown to be optimal. Moreover, we show that its momentum variant converges as fast as SGD with the bandwidth-based step-decay step-size. Finally, we propose novel step-size schemes in the bandwidth-based family and verify their efficiency on several deep neural network training tasks.
- Publication:
-
arXiv e-prints
- Pub Date:
- June 2021
- DOI:
- 10.48550/arXiv.2106.02888
- arXiv:
- arXiv:2106.02888
- Bibcode:
- 2021arXiv210602888W
- Keywords:
-
- Computer Science - Machine Learning;
- Mathematics - Optimization and Control