Failure detectors (or, more accurately Failure Suspectors (FS)) appear to be a fundamental service upon which to build fault-tolerant, distributed applications. This paper shows that a FS with very weak semantics (i.e., that delivers failure and recovery information in no specific order) suffices to implement virtually-synchronous communication (VSC) in an asynchronous system subject to process crash failures and network partitions. The VSC paradigm is particularly useful in asynchronous systems and greatly simplifies building fault-tolerant applications that mask failures by replicating processes. We suggest a three-component architecture to implement virtually-synchronous communication: (1) at the lowest level, the FS component; (2) on top of it, a component (2a) that defines new views; and (3) a component (2b) that reliably multicasts messages within a view. The issues covered in this paper also lead to a better understanding of the various membership service semantics proposed in recent literature.
NASA STI/Recon Technical Report N
- Pub Date:
- April 1993
- Fault Tolerance;
- Distributed Processing;
- Quality Assurance and Reliability