Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants

doi:10.48550/arXiv.2502.02431

Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants

Recent advancements in deep learning optimization have introduced new algorithms, such as Schedule-Free optimizers, AdEMAMix, MARS and Lion which modify traditional momentum mechanisms. In a separate line of work, theoretical acceleration of stochastic gradient descent (SGD) in noise-dominated regime has been achieved by decoupling the momentum coefficient from the current gradient's weight. In this paper, we establish explicit connections between these two lines of work. We substantiate our theoretical findings with preliminary experiments on a 150m language modeling task. We find that AdEMAMix, which most closely resembles accelerated versions of stochastic gradient descent, exhibits superior performance. Building on these insights, we introduce a modification to AdEMAMix, termed Simplified-AdEMAMix, which maintains the same performance as AdEMAMix across both large and small batch-size settings while eliminating the need for two different momentum terms. The code for Simplified-AdEMAMix is available on the repository: https://github.com/DepenM/Simplified-AdEMAMix/.

Publication:

arXiv e-prints

Pub Date:

February 2025

DOI:

10.48550/arXiv.2502.02431

arXiv:

arXiv:2502.02431

Bibcode:

2025arXiv250202431M

Keywords:

Computer Science - Machine Learning;
Computer Science - Artificial Intelligence

ADS

Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants

Abstract