Checkpointing and Localized Recovery for Nested Fork-Join Programs

doi:10.48550/arXiv.2102.12941

Checkpointing and Localized Recovery for Nested Fork-Join Programs

Fohry, Claudia

While checkpointing is typically combined with a restart of the whole application, localized recovery permits all but the affected processes to continue. In task-based cluster programming, for instance, the application can then be finished on the intact nodes, and the lost tasks be reassigned. This extended abstract suggests to adapt a checkpointing and localized recovery technique that has originally been developed for independent tasks to nested fork-join programs. We consider a Cilk-like work stealing scheme with work-first policy in a distributed memory setting, and describe the required algorithmic changes. The original technique has checkpointing overheads below 1% and neglectable costs for recovery, we expect the new algorithm to achieve a similar performance.

Publication:

arXiv e-prints

Pub Date:

February 2021

DOI:

10.48550/arXiv.2102.12941

arXiv:

arXiv:2102.12941

Bibcode:

2021arXiv210212941F

Keywords:

Computer Science - Distributed;
Parallel;
and Cluster Computing

NASA/ADS

Checkpointing and Localized Recovery for Nested Fork-Join Programs

Abstract