An SDE for Modeling SAM: Theory and Insights
Abstract
We study the SAM (Sharpness-Aware Minimization) optimizer which has recently attracted a lot of interest due to its increased performance over more classical variants of stochastic gradient descent. Our main contribution is the derivation of continuous-time models (in the form of SDEs) for SAM and two of its variants, both for the full-batch and mini-batch settings. We demonstrate that these SDEs are rigorous approximations of the real discrete-time algorithms (in a weak sense, scaling linearly with the learning rate). Using these models, we then offer an explanation of why SAM prefers flat minima over sharp ones~--~by showing that it minimizes an implicitly regularized loss with a Hessian-dependent noise structure. Finally, we prove that SAM is attracted to saddle points under some realistic conditions. Our theoretical results are supported by detailed experiments.
- Publication:
-
arXiv e-prints
- Pub Date:
- January 2023
- DOI:
- 10.48550/arXiv.2301.08203
- arXiv:
- arXiv:2301.08203
- Bibcode:
- 2023arXiv230108203M
- Keywords:
-
- Computer Science - Machine Learning;
- Mathematics - Optimization and Control
- E-Print:
- Accepted at ICML 2023 (Poster)