SparkJNI: A Toolchain for Hardware Accelerated Big Data Apache Spark

Tudor Alexandru  Voicu; Zaid Al-Ars

doi:10.1109/ICBDA.2019.8713201

SparkJNI: A Toolchain for Hardware Accelerated Big Data Apache Spark

Tudor Alexandru Voicu, Zaid Al-Ars

Computer Engineering

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

4 Citations (Scopus)

Abstract

The JVM (Java virtual machine) is the cornerstone in most big data frameworks, focusing on automatic memory management and enabling high-productivity languages. Aside from the performance overhead induced by JVM languages (e.g., Java, Scala, etc.), big data frameworks, including Spark, also restrict code execution to general purpose processors (CPUs), while HPC clusters readily include dedicated accelerators for achieving their high performance. In this paper, we analyze the state-of-the-art developments in the field of heterogeneously accelerated Spark, and we propose SparkJNI, a framework for JNI accelerated Spark. The design provides two main components. First, it enables a seamless utilization of native CPU code, in addition to integration of GPU as well as FPGA accelerators. Secondly, SparkJNI enables accelerated execution through native code integration by automatically generating C++ code wrappers for easy code development by the programmer. This makes it non-disruptive to the Java programmer, while allowing great flexibility for native code development. Results of running a number of benchmarks show insignificant JNI-induced overhead in access time and bandwidth, with speedups of up to 12x for compute-intensive kernels (such as convolution), in comparison to pure Java Spark implementations. Last, a DNA analysis algorithm (Pair-HMM) is implemented in Spark and integrated with FPGAs, targeting cluster deployments, with benchmark results showing an overall speedup of \sim 2.7x over state-of-the art CPU optimizations. The result of the presented work, along with the SparkJNI framework are publicly available on GitHub for open-source usage and development.

Original language	English
Title of host publication	2019 4th IEEE International Conference on Big Data Analytics, ICBDA 2019
Editors	Sheng-Uei Guan, Kang Zhang, Jiannong Cao
Place of Publication	Piscataway, NJ, USA
Publisher	IEEE
Pages	152-157
Number of pages	6
ISBN (Electronic)	978-1-7281-1282-4
ISBN (Print)	978-1-7281-1283-1
DOIs	https://doi.org/10.1109/ICBDA.2019.8713201
Publication status	Published - 2019
Event	4th IEEE International Conference on Big Data Analytics, ICBDA 2019 - Suzhou, China Duration: 15 Mar 2019 → 18 Mar 2019

Conference

Conference	4th IEEE International Conference on Big Data Analytics, ICBDA 2019
Country/Territory	China
City	Suzhou
Period	15/03/19 → 18/03/19

Keywords

Big Data
Hardware Acceleration.
Heterogeneous Architecture
JVM
Spark

Access to Document

10.1109/ICBDA.2019.8713201

Cite this

@inproceedings{5ec10b570557481babee9090521a1869,

title = "SparkJNI: A Toolchain for Hardware Accelerated Big Data Apache Spark",

abstract = "The JVM (Java virtual machine) is the cornerstone in most big data frameworks, focusing on automatic memory management and enabling high-productivity languages. Aside from the performance overhead induced by JVM languages (e.g., Java, Scala, etc.), big data frameworks, including Spark, also restrict code execution to general purpose processors (CPUs), while HPC clusters readily include dedicated accelerators for achieving their high performance. In this paper, we analyze the state-of-the-art developments in the field of heterogeneously accelerated Spark, and we propose SparkJNI, a framework for JNI accelerated Spark. The design provides two main components. First, it enables a seamless utilization of native CPU code, in addition to integration of GPU as well as FPGA accelerators. Secondly, SparkJNI enables accelerated execution through native code integration by automatically generating C++ code wrappers for easy code development by the programmer. This makes it non-disruptive to the Java programmer, while allowing great flexibility for native code development. Results of running a number of benchmarks show insignificant JNI-induced overhead in access time and bandwidth, with speedups of up to 12x for compute-intensive kernels (such as convolution), in comparison to pure Java Spark implementations. Last, a DNA analysis algorithm (Pair-HMM) is implemented in Spark and integrated with FPGAs, targeting cluster deployments, with benchmark results showing an overall speedup of \sim 2.7x over state-of-the art CPU optimizations. The result of the presented work, along with the SparkJNI framework are publicly available on GitHub for open-source usage and development.",

keywords = "Big Data, Hardware Acceleration., Heterogeneous Architecture, JVM, Spark",

author = "Voicu, {Tudor Alexandru} and Zaid Al-Ars",

year = "2019",

doi = "10.1109/ICBDA.2019.8713201",

language = "English",

isbn = " 978-1-7281-1283-1 ",

pages = "152--157",

editor = "Guan, {Sheng-Uei } and Zhang, {Kang } and Jiannong Cao",

booktitle = "2019 4th IEEE International Conference on Big Data Analytics, ICBDA 2019",

publisher = "IEEE",

address = "United States",

note = "4th IEEE International Conference on Big Data Analytics, ICBDA 2019 ; Conference date: 15-03-2019 Through 18-03-2019",

}

Voicu, TA & Al-Ars, Z 2019, SparkJNI: A Toolchain for Hardware Accelerated Big Data Apache Spark. in S-U Guan, K Zhang & J Cao (eds), 2019 4th IEEE International Conference on Big Data Analytics, ICBDA 2019., 8713201, IEEE, Piscataway, NJ, USA, pp. 152-157, 4th IEEE International Conference on Big Data Analytics, ICBDA 2019, Suzhou, China, 15/03/19. https://doi.org/10.1109/ICBDA.2019.8713201

SparkJNI: A Toolchain for Hardware Accelerated Big Data Apache Spark. / Voicu, Tudor Alexandru ; Al-Ars, Zaid.
2019 4th IEEE International Conference on Big Data Analytics, ICBDA 2019. ed. / Sheng-Uei Guan; Kang Zhang; Jiannong Cao. Piscataway, NJ, USA: IEEE, 2019. p. 152-157 8713201.

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - SparkJNI

T2 - 4th IEEE International Conference on Big Data Analytics, ICBDA 2019

AU - Voicu, Tudor Alexandru

AU - Al-Ars, Zaid

PY - 2019

Y1 - 2019

N2 - The JVM (Java virtual machine) is the cornerstone in most big data frameworks, focusing on automatic memory management and enabling high-productivity languages. Aside from the performance overhead induced by JVM languages (e.g., Java, Scala, etc.), big data frameworks, including Spark, also restrict code execution to general purpose processors (CPUs), while HPC clusters readily include dedicated accelerators for achieving their high performance. In this paper, we analyze the state-of-the-art developments in the field of heterogeneously accelerated Spark, and we propose SparkJNI, a framework for JNI accelerated Spark. The design provides two main components. First, it enables a seamless utilization of native CPU code, in addition to integration of GPU as well as FPGA accelerators. Secondly, SparkJNI enables accelerated execution through native code integration by automatically generating C++ code wrappers for easy code development by the programmer. This makes it non-disruptive to the Java programmer, while allowing great flexibility for native code development. Results of running a number of benchmarks show insignificant JNI-induced overhead in access time and bandwidth, with speedups of up to 12x for compute-intensive kernels (such as convolution), in comparison to pure Java Spark implementations. Last, a DNA analysis algorithm (Pair-HMM) is implemented in Spark and integrated with FPGAs, targeting cluster deployments, with benchmark results showing an overall speedup of \sim 2.7x over state-of-the art CPU optimizations. The result of the presented work, along with the SparkJNI framework are publicly available on GitHub for open-source usage and development.

AB - The JVM (Java virtual machine) is the cornerstone in most big data frameworks, focusing on automatic memory management and enabling high-productivity languages. Aside from the performance overhead induced by JVM languages (e.g., Java, Scala, etc.), big data frameworks, including Spark, also restrict code execution to general purpose processors (CPUs), while HPC clusters readily include dedicated accelerators for achieving their high performance. In this paper, we analyze the state-of-the-art developments in the field of heterogeneously accelerated Spark, and we propose SparkJNI, a framework for JNI accelerated Spark. The design provides two main components. First, it enables a seamless utilization of native CPU code, in addition to integration of GPU as well as FPGA accelerators. Secondly, SparkJNI enables accelerated execution through native code integration by automatically generating C++ code wrappers for easy code development by the programmer. This makes it non-disruptive to the Java programmer, while allowing great flexibility for native code development. Results of running a number of benchmarks show insignificant JNI-induced overhead in access time and bandwidth, with speedups of up to 12x for compute-intensive kernels (such as convolution), in comparison to pure Java Spark implementations. Last, a DNA analysis algorithm (Pair-HMM) is implemented in Spark and integrated with FPGAs, targeting cluster deployments, with benchmark results showing an overall speedup of \sim 2.7x over state-of-the art CPU optimizations. The result of the presented work, along with the SparkJNI framework are publicly available on GitHub for open-source usage and development.

KW - Big Data

KW - Hardware Acceleration.

KW - Heterogeneous Architecture

KW - JVM

KW - Spark

UR - http://www.scopus.com/inward/record.url?scp=85066608307&partnerID=8YFLogxK

U2 - 10.1109/ICBDA.2019.8713201

DO - 10.1109/ICBDA.2019.8713201

M3 - Conference contribution

AN - SCOPUS:85066608307

SN - 978-1-7281-1283-1

SP - 152

EP - 157

BT - 2019 4th IEEE International Conference on Big Data Analytics, ICBDA 2019

A2 - Guan, Sheng-Uei

A2 - Zhang, Kang

A2 - Cao, Jiannong

PB - IEEE

CY - Piscataway, NJ, USA

Y2 - 15 March 2019 through 18 March 2019

ER -

SparkJNI: A Toolchain for Hardware Accelerated Big Data Apache Spark

Abstract

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this