Checkpointing and Localized Recovery for Nested Fork-Join Programs
Abstract
While checkpointing is typically combined with a restart of the whole application, localized recovery permits all but the affected processes to continue. In task-based cluster programming, for instance, the application can then be finished on the intact nodes, and the lost tasks be reassigned. This extended abstract suggests to adapt a checkpointing and localized recovery technique that has originally been developed for independent tasks to nested fork-join programs. We consider a Cilk-like work stealing scheme with work-first policy in a distributed memory setting, and describe the required algorithmic changes. The original technique has checkpointing overheads below 1% and neglectable costs for recovery, we expect the new algorithm to achieve a similar performance.
- Publication:
-
arXiv e-prints
- Pub Date:
- February 2021
- DOI:
- 10.48550/arXiv.2102.12941
- arXiv:
- arXiv:2102.12941
- Bibcode:
- 2021arXiv210212941F
- Keywords:
-
- Computer Science - Distributed;
- Parallel;
- and Cluster Computing