Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD

doi:10.48550/arXiv.2412.20553

Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD

Recent findings by Cohen et al., 2021, demonstrate that when training neural networks with full-batch gradient descent at a step size of $\eta$, the sharpness--defined as the largest eigenvalue of the full batch Hessian--consistently stabilizes at $2/\eta$. These results have significant implications for convergence and generalization. Unfortunately, this was observed not to be the case for mini-batch stochastic gradient descent (SGD), thus limiting the broader applicability of these findings. We show that SGD trains in a different regime we call Edge of Stochastic Stability. In this regime, what hovers at $2/\eta$ is, instead, the average over the batches of the largest eigenvalue of the Hessian of the mini batch (MiniBS) loss--which is always bigger than the sharpness. This implies that the sharpness is generally lower when training with smaller batches or bigger learning rate, providing a basis for the observed implicit regularization effect of SGD towards flatter minima and a number of well established empirical phenomena. Additionally, we quantify the gap between the MiniBS and the sharpness, further characterizing this distinct training regime.

Publication:

arXiv e-prints

Pub Date:

December 2024

DOI:

10.48550/arXiv.2412.20553

arXiv:

arXiv:2412.20553

Bibcode:

2024arXiv241220553A

Keywords:

Computer Science - Machine Learning;
Mathematics - Optimization and Control;
Statistics - Machine Learning

E-Print:

28 pages, 24 figures

ADS

Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD

Abstract