Standard

Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks. / Epema, Dick; Ghit, Bogdan.

26th Int'l Symp. on High-Performance Parallel and Distributed Computing (HPDC). ACM DL, 2017. p. 105.

Research output: Scientific - peer-reviewConference contribution

Harvard

Epema, D & Ghit, B 2017, Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks. in 26th Int'l Symp. on High-Performance Parallel and Distributed Computing (HPDC). ACM DL, pp. 105. DOI: 10.1145/3078597.3078600

APA

Epema, D., & Ghit, B. (2017). Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks. In 26th Int'l Symp. on High-Performance Parallel and Distributed Computing (HPDC). (pp. 105). ACM DL. DOI: 10.1145/3078597.3078600

Vancouver

Epema D, Ghit B. Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks. In 26th Int'l Symp. on High-Performance Parallel and Distributed Computing (HPDC). ACM DL. 2017. p. 105. Available from, DOI: 10.1145/3078597.3078600

Author

Epema, Dick; Ghit, Bogdan / Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks.

26th Int'l Symp. on High-Performance Parallel and Distributed Computing (HPDC). ACM DL, 2017. p. 105.

Research output: Scientific - peer-reviewConference contribution

BibTeX

@inbook{dc7e56568ac149ed8e50024516516d7c,
title = "Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks",
author = "Dick Epema and Bogdan Ghit",
year = "2017",
doi = "10.1145/3078597.3078600",
pages = "105",
booktitle = "26th Int'l Symp. on High-Performance Parallel and Distributed Computing (HPDC)",
publisher = "ACM DL",

}

RIS

TY - CHAP

T1 - Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks

AU - Epema,Dick

AU - Ghit,Bogdan

PY - 2017

Y1 - 2017

N2 - Providing fault-tolerance is of major importance for data analyticsframeworks such as Hadoop and Spark, which are typicallydeployed in large clusters that are known to experience high failuresrates. Unexpected events such as compute node failures arein particular an important challenge for in-memory data analyticsframeworks, as the widely adopted approach to deal with them isto recompute work already done. Recomputing lost work, however,requires allocation of extra resource to re-execute tasks, thus increasingthe job runtimes. To address this problem, we design acheckpointing system called panda that is tailored to the intrinsiccharacteristics of data analytics frameworks. In particular, pandaemploys fine-grained checkpointing at the level of task outputs anddynamically identifies tasks that are worthwhile to be checkpointedrather than be recomputed. As has been abundantly shown, tasksof data analytics jobs may have very variable runtimes and outputsizes. These properties form the basis of three checkpointingpolicies which we incorporate into panda.We first empirically evaluate panda on a multicluster systemwith single data analytics applications under space-correlated failures,and find that panda is close to the performance of a fail-freeexecution in unmodified Spark for a large range of concurrent failures.Then we perform simulations of complete workloads, mimickingthe size and operation of a Google cluster, and show that pandaprovides significant improvements in the average job runtime forwide ranges of the failure rate and system load.

AB - Providing fault-tolerance is of major importance for data analyticsframeworks such as Hadoop and Spark, which are typicallydeployed in large clusters that are known to experience high failuresrates. Unexpected events such as compute node failures arein particular an important challenge for in-memory data analyticsframeworks, as the widely adopted approach to deal with them isto recompute work already done. Recomputing lost work, however,requires allocation of extra resource to re-execute tasks, thus increasingthe job runtimes. To address this problem, we design acheckpointing system called panda that is tailored to the intrinsiccharacteristics of data analytics frameworks. In particular, pandaemploys fine-grained checkpointing at the level of task outputs anddynamically identifies tasks that are worthwhile to be checkpointedrather than be recomputed. As has been abundantly shown, tasksof data analytics jobs may have very variable runtimes and outputsizes. These properties form the basis of three checkpointingpolicies which we incorporate into panda.We first empirically evaluate panda on a multicluster systemwith single data analytics applications under space-correlated failures,and find that panda is close to the performance of a fail-freeexecution in unmodified Spark for a large range of concurrent failures.Then we perform simulations of complete workloads, mimickingthe size and operation of a Google cluster, and show that pandaprovides significant improvements in the average job runtime forwide ranges of the failure rate and system load.

U2 - 10.1145/3078597.3078600

DO - 10.1145/3078597.3078600

M3 - Conference contribution

SP - 105

BT - 26th Int'l Symp. on High-Performance Parallel and Distributed Computing (HPDC)

PB - ACM DL

ER -

ID: 29539981