A Reliability Model for Dependent and Distributed MDS Disk Array Units
Abstract
Archiving and systematic backup of large digital data generates a quick demand for multi-peta byte scale storage systems. As drive capacities continue to grow beyond the few terabytes range to address the demands of today's cloud, the likelihood of having multiple/simultaneous disk failures become a reality. Among the main factors causing catastrophic system failures, correlated disk failures and the network bandwidth are reported to be the two common source of performance degradation. The emerging trend is to use efficient/sophisticated erasure codes (EC) equipped with multiple parities and efficient repairs in order to meet the reliability/bandwidth requirements. It is known that mean time to failure and repair rates reported by the disk manufacturers cannot capture life cycle patterns of distributed storage systems. In this study, we develop failure models based on generalized Markov chains that can accurately capture correlated performance degradations with multi-parity protection schemes based on modern Maximum Distance Separable (MDS) EC. Furthermore, we use the proposed model in a distributed storage scenario to quantify two example use cases: Primarily, the common sense that adding more parity disks are only meaningful if we have a decent decorrelation between the failure domains of storage systems and the reliability of generic multiple single-dimensional EC protected storage systems.
- Publication:
-
arXiv e-prints
- Pub Date:
- October 2018
- DOI:
- arXiv:
- arXiv:1810.10621
- Bibcode:
- 2018arXiv181010621A
- Keywords:
-
- Computer Science - Information Theory
- E-Print:
- This Paper Has Been Accepted For Publication In IEEE Transactions On Reliability, Oct. 2018. (unedited Version)