Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks

Bogdan Ghit; Dick Epema

doi:10.1145/3078597.3078600

Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks

Data-Intensive Systems

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

10 Citations (Scopus)

140 Downloads (Pure)

Abstract

Providing fault-tolerance is of major importance for data analytics frameworks such as Hadoop and Spark, which are typically deployed in large clusters that are known to experience high failures rates. Unexpected events such as compute node failures are in particular an important challenge for in-memory data analytics frameworks, as the widely adopted approach to deal with them is to recompute work already done. Recomputing lost work, however, requires allocation of extra resource to re-execute tasks, thus increasing the job runtimes. To address this problem, we design a checkpointing system called panda that is tailored to the intrinsic characteristics of data analytics frameworks. In particular, panda employs fine-grained checkpointing at the level of task outputs and dynamically identifies tasks that are worthwhile to be checkpointed
rather than be recomputed. As has been abundantly shown, tasks of data analytics jobs may have very variable runtimes and output sizes. These properties form the basis of three checkpointing policies which we incorporate into panda. We first empirically evaluate panda on a multicluster system with single data analytics applications under space-correlated failures, and find that panda is close to the performance of a fail-free execution in unmodified Spark for a large range of concurrent failures. Then we perform simulations of complete workloads, mimicking the size and operation of a Google cluster, and show that panda provides significant improvements in the average job runtime for wide ranges of the failure rate and system load.

Original language	English
Title of host publication	Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2017
Place of Publication	New York, NY
Publisher	Association for Computing Machinery (ACM)
Pages	105-116
Number of pages	12
ISBN (Electronic)	978-1-4503-4699-3
DOIs	https://doi.org/10.1145/3078597.3078600
Publication status	Published - 2017
Event	HPDC 2017: 26th International Symposium on High-Performance Parallel and Distributed Computing - Washington, DC, United States Duration: 26 Jul 2017 → 30 Jul 2017 Conference number: 26 http://www.hpdc.org/2017/

Conference

Conference	HPDC 2017
Country/Territory	United States
City	Washington, DC
Period	26/07/17 → 30/07/17
Internet address	http://www.hpdc.org/2017/

Access to Document

10.1145/3078597.3078600

HPDC2017_Ghit_EpemaAccepted author manuscript, 758 KB

Cite this

@inproceedings{dc7e56568ac149ed8e50024516516d7c,

title = "Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks",

abstract = "Providing fault-tolerance is of major importance for data analytics frameworks such as Hadoop and Spark, which are typically deployed in large clusters that are known to experience high failures rates. Unexpected events such as compute node failures are in particular an important challenge for in-memory data analytics frameworks, as the widely adopted approach to deal with them is to recompute work already done. Recomputing lost work, however, requires allocation of extra resource to re-execute tasks, thus increasing the job runtimes. To address this problem, we design a checkpointing system called panda that is tailored to the intrinsic characteristics of data analytics frameworks. In particular, panda employs fine-grained checkpointing at the level of task outputs and dynamically identifies tasks that are worthwhile to be checkpointedrather than be recomputed. As has been abundantly shown, tasks of data analytics jobs may have very variable runtimes and output sizes. These properties form the basis of three checkpointing policies which we incorporate into panda. We first empirically evaluate panda on a multicluster system with single data analytics applications under space-correlated failures, and find that panda is close to the performance of a fail-free execution in unmodified Spark for a large range of concurrent failures. Then we perform simulations of complete workloads, mimicking the size and operation of a Google cluster, and show that panda provides significant improvements in the average job runtime for wide ranges of the failure rate and system load.",

author = "Bogdan Ghit and Dick Epema",

year = "2017",

doi = "10.1145/3078597.3078600",

language = "English",

pages = "105--116",

booktitle = "Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2017",

publisher = "Association for Computing Machinery (ACM)",

address = "United States",

note = "HPDC 2017 : 26th International Symposium on High-Performance Parallel and Distributed Computing ; Conference date: 26-07-2017 Through 30-07-2017",

url = "http://www.hpdc.org/2017/",

}

Ghit, B & Epema, D 2017, Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks. in Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2017. Association for Computing Machinery (ACM), New York, NY, pp. 105-116, HPDC 2017, Washington, DC, United States, 26/07/17. https://doi.org/10.1145/3078597.3078600

Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks. / Ghit, Bogdan; Epema, Dick.
Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2017. New York, NY: Association for Computing Machinery (ACM), 2017. p. 105-116.

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - Better Safe than Sorry

T2 - HPDC 2017

AU - Ghit, Bogdan

AU - Epema, Dick

N1 - Conference code: 26

PY - 2017

Y1 - 2017

N2 - Providing fault-tolerance is of major importance for data analytics frameworks such as Hadoop and Spark, which are typically deployed in large clusters that are known to experience high failures rates. Unexpected events such as compute node failures are in particular an important challenge for in-memory data analytics frameworks, as the widely adopted approach to deal with them is to recompute work already done. Recomputing lost work, however, requires allocation of extra resource to re-execute tasks, thus increasing the job runtimes. To address this problem, we design a checkpointing system called panda that is tailored to the intrinsic characteristics of data analytics frameworks. In particular, panda employs fine-grained checkpointing at the level of task outputs and dynamically identifies tasks that are worthwhile to be checkpointedrather than be recomputed. As has been abundantly shown, tasks of data analytics jobs may have very variable runtimes and output sizes. These properties form the basis of three checkpointing policies which we incorporate into panda. We first empirically evaluate panda on a multicluster system with single data analytics applications under space-correlated failures, and find that panda is close to the performance of a fail-free execution in unmodified Spark for a large range of concurrent failures. Then we perform simulations of complete workloads, mimicking the size and operation of a Google cluster, and show that panda provides significant improvements in the average job runtime for wide ranges of the failure rate and system load.

AB - Providing fault-tolerance is of major importance for data analytics frameworks such as Hadoop and Spark, which are typically deployed in large clusters that are known to experience high failures rates. Unexpected events such as compute node failures are in particular an important challenge for in-memory data analytics frameworks, as the widely adopted approach to deal with them is to recompute work already done. Recomputing lost work, however, requires allocation of extra resource to re-execute tasks, thus increasing the job runtimes. To address this problem, we design a checkpointing system called panda that is tailored to the intrinsic characteristics of data analytics frameworks. In particular, panda employs fine-grained checkpointing at the level of task outputs and dynamically identifies tasks that are worthwhile to be checkpointedrather than be recomputed. As has been abundantly shown, tasks of data analytics jobs may have very variable runtimes and output sizes. These properties form the basis of three checkpointing policies which we incorporate into panda. We first empirically evaluate panda on a multicluster system with single data analytics applications under space-correlated failures, and find that panda is close to the performance of a fail-free execution in unmodified Spark for a large range of concurrent failures. Then we perform simulations of complete workloads, mimicking the size and operation of a Google cluster, and show that panda provides significant improvements in the average job runtime for wide ranges of the failure rate and system load.

UR - http://resolver.tudelft.nl/uuid:dc7e5656-8ac1-49ed-8e50-024516516d7c

U2 - 10.1145/3078597.3078600

DO - 10.1145/3078597.3078600

M3 - Conference contribution

SP - 105

EP - 116

BT - Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2017

PB - Association for Computing Machinery (ACM)

CY - New York, NY

Y2 - 26 July 2017 through 30 July 2017

ER -

Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks

Abstract

Conference

Access to Document

Other files and links

Fingerprint

Cite this