HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline
Abstract
Prior research in resource scheduling for machine learning training workloads has largely focused on minimizing job completion times. Commonly, these model training workloads collectively search over a large number of parameter values that control the learning process in a hyperparameter search. It is preferable to identify and maximally provision the best-performing hyperparameter configuration (trial) to achieve the highest accuracy result as soon as possible. To optimally trade-off evaluating multiple configurations and training the most promising ones by a fixed deadline, we design and build HyperSched -- a dynamic application-level resource scheduler to track, identify, and preferentially allocate resources to the best performing trials to maximize accuracy by the deadline. HyperSched leverages three properties of a hyperparameter search workload over-looked in prior work - trial disposability, progressively identifiable rankings among different configurations, and space-time constraints - to outperform standard hyperparameter search algorithms across a variety of benchmarks.
- Publication:
-
arXiv e-prints
- Pub Date:
- January 2020
- DOI:
- 10.48550/arXiv.2001.02338
- arXiv:
- arXiv:2001.02338
- Bibcode:
- 2020arXiv200102338L
- Keywords:
-
- Computer Science - Distributed;
- Parallel;
- and Cluster Computing;
- Computer Science - Machine Learning