Dynamic multiple node failure recovery in distributed storage systems

Itani, May; Sharafeddine, Sanaa; ElKabani, Islam

Dynamic multiple node failure recovery in distributed storage systems

Itani, May; Sharafeddine, Sanaa; ElKabani, Islam

URI: http://hdl.handle.net/10725/8022

URL: https://www.sciencedirect.com/science/article/pii/S1570870517302263

DOI: https://doi.org/10.1016/j.adhoc.2017.12.007

Date: 2018-06-07

Terms of Use: This item is made available under the terms and conditions applicable to " Article ", as set forth at: http://libraries.lau.edu.lb/research/laur/terms-of-use/articles.php

Abstract:

Our daily lives are getting more and more dependent on data centers and distributed storage systems in general, whether at the business or at the personal level. With the advent of fog computing, personal mobile devices in a given geographical area may also comprise a very dynamic distributed storage system. These paradigm changes call for the urgent need of devising efficient and reliable failure recovery mechanisms in dynamic scenarios where failures become more likely and nodes join and leave the network more frequently. Redundancy schemes in distributed storage systems have become essential for providing reliability given the fact of frequent node failures. In this work, we address the problem of multiple failure recovery with dynamic scenarios using the fractional repetition code as a redundancy scheme. The fractional repetition (FR) code is a class of regenerating codes that concatenates a maximum distance separable code (MDS) with an inner fractional repetition code where data is split into several blocks then replicated and multiple replicas of each block are stored on various system nodes. We formulate the problem as an integer linear programming problem and extend it to account for three dynamic scenarios of newly arriving blocks, nodes, and variable priority blocks allocation. The contribution of this paper is four-fold: i. we generate an optimized block distribution scheme that minimizes the total system repair cost of all dependent and independent multiple node failure scenarios; ii. we address the practical scenario of having newly arriving blocks and allocate those blocks to existing nodes without any modification to the original on-node block distribution; iii. we consider new-comer nodes and generate an updated optimized block distribution; iv. we consider optimized storage and recovery of blocks with varying priority using variable fractional repetition codes. The four problems are modeled using incidence matrices and solved heuristically. We present a range of results for our proposed algorithms in several scenarios to assess the effectiveness of the solution approaches that are shown to generate results close to optimal.