Multi-level Explanation of Deep Reinforcement Learning-based Scheduling

doi:10.48550/arXiv.2209.09645

Multi-level Explanation of Deep Reinforcement Learning-based Scheduling

Dependency-aware job scheduling in the cluster is NP-hard. Recent work shows that Deep Reinforcement Learning (DRL) is capable of solving it. It is difficult for the administrator to understand the DRL-based policy even though it achieves remarkable performance gain. Therefore the complex model-based scheduler is not easy to gain trust in the system where simplicity is favored. In this paper, we give the multi-level explanation framework to interpret the policy of DRL-based scheduling. We dissect its decision-making process to job level and task level and approximate each level with interpretable models and rules, which align with operational practices. We show that the framework gives the system administrator insights into the state-of-the-art scheduler and reveals the robustness issue in regards to its behavior pattern.

Publication:

arXiv e-prints

Pub Date:

September 2022

DOI:

10.48550/arXiv.2209.09645

arXiv:

arXiv:2209.09645

Bibcode:

2022arXiv220909645Z

Keywords:

Computer Science - Distributed;
Parallel;
and Cluster Computing;
Computer Science - Artificial Intelligence;
Computer Science - Operating Systems

E-Print:

Accepted in the MLSys'22 Workshop on Cloud Intelligence / AIOps

NASA/ADS

Multi-level Explanation of Deep Reinforcement Learning-based Scheduling

Abstract