Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD
Abstract
Recent findings by Cohen et al., 2021, demonstrate that when training neural networks with full-batch gradient descent at a step size of $\eta$, the sharpness--defined as the largest eigenvalue of the full batch Hessian--consistently stabilizes at $2/\eta$. These results have significant implications for convergence and generalization. Unfortunately, this was observed not to be the case for mini-batch stochastic gradient descent (SGD), thus limiting the broader applicability of these findings. We show that SGD trains in a different regime we call Edge of Stochastic Stability. In this regime, what hovers at $2/\eta$ is, instead, the average over the batches of the largest eigenvalue of the Hessian of the mini batch (MiniBS) loss--which is always bigger than the sharpness. This implies that the sharpness is generally lower when training with smaller batches or bigger learning rate, providing a basis for the observed implicit regularization effect of SGD towards flatter minima and a number of well established empirical phenomena. Additionally, we quantify the gap between the MiniBS and the sharpness, further characterizing this distinct training regime.
- Publication:
-
arXiv e-prints
- Pub Date:
- December 2024
- DOI:
- arXiv:
- arXiv:2412.20553
- Bibcode:
- 2024arXiv241220553A
- Keywords:
-
- Computer Science - Machine Learning;
- Mathematics - Optimization and Control;
- Statistics - Machine Learning
- E-Print:
- 28 pages, 24 figures