Providing fault-tolerance is of major importance for data analytics
frameworks such as Hadoop and Spark, which are typically
deployed in large clusters that are known to experience high failures
rates. Unexpected events such as compute node failures are
in particular an important challenge for in-memory data analytics
frameworks, as the widely adopted approach to deal with them is
to recompute work already done. Recomputing lost work, however,
requires allocation of extra resource to re-execute tasks, thus increasing
the job runtimes. To address this problem, we design a
checkpointing system called panda that is tailored to the intrinsic
characteristics of data analytics frameworks. In particular, panda
employs fine-grained checkpointing at the level of task outputs and
dynamically identifies tasks that are worthwhile to be checkpointed
rather than be recomputed. As has been abundantly shown, tasks
of data analytics jobs may have very variable runtimes and output
sizes. These properties form the basis of three checkpointing
policies which we incorporate into panda.
We first empirically evaluate panda on a multicluster system
with single data analytics applications under space-correlated failures,
and find that panda is close to the performance of a fail-free
execution in unmodified Spark for a large range of concurrent failures.
Then we perform simulations of complete workloads, mimicking
the size and operation of a Google cluster, and show that panda
provides significant improvements in the average job runtime for
wide ranges of the failure rate and system load.
Original languageEnglish
Title of host publication26th Int'l Symp. on High-Performance Parallel and Distributed Computing (HPDC)
PublisherACM DL
Number of pages116
ISBN (Electronic)978-1-4503-4699-3
StatePublished - 2017

ID: 29539981