SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework

Hamid Mushtaq; Nauman Ahmed; Zaid Al-Ars

doi:10.1371/journal.pone.0224784

SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework

Hamid Mushtaq, Nauman Ahmed, Zaid Al-Ars^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › Scientific › peer-review

9 Citations (Scopus)

57 Downloads (Pure)

Abstract

Due to the rapid decrease in the cost of NGS (Next Generation Sequencing), interest has increased in using data generated from NGS to diagnose genetic diseases. However, the data generated by NGS technology is usually in the order of hundreds of gigabytes per experiment, thus requiring efficient and scalable programs to perform data analysis quickly. This paper presents SparkGA2, a memory efficient, production quality framework for high performance DNA analysis in the cloud, which can scale according to the available computational resources by increasing the number of nodes. Our framework uses Apache Spark's ability to cache data in the memory to speed up processing, while also allowing the user to run the framework on systems with lower amounts of memory at the cost of slightly less performance. To manage the memory footprint, we implement an on-the-fly compression method of intermediate data and reduce memory requirements by up to 3x. Our framework also uses a streaming approach to gradually stream input data as processing is taking place. This makes our framework faster than other state of the art approaches while at the same time allowing users to adapt it to run on clusters with lower memory. As compared to the state of the art, SparkGA2 is up to 22% faster on a large big data cluster of 67 nodes and up to 9% faster on a smaller cluster of 6 nodes. Including the streaming solution, where data pre-processing is considered, SparkGA2 is 51% faster on a 6 node cluster. The source code of SparkGA2 is publicly available at https://github.com/HamidMushtaq/SparkGA2.

Original language	English
Article number	e0224784
Pages (from-to)	1-14
Number of pages	14
Journal	PLoS ONE
Volume	14
Issue number	12
DOIs	https://doi.org/10.1371/journal.pone.0224784
Publication status	Published - 2019

Access to Document

10.1371/journal.pone.0224784

SparkGA2Final published version, 1.44 MBLicence: CC BY

Cite this

@article{8d2971a52a1e4affbc2c8008ee3270e9,

title = "SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework",

abstract = "Due to the rapid decrease in the cost of NGS (Next Generation Sequencing), interest has increased in using data generated from NGS to diagnose genetic diseases. However, the data generated by NGS technology is usually in the order of hundreds of gigabytes per experiment, thus requiring efficient and scalable programs to perform data analysis quickly. This paper presents SparkGA2, a memory efficient, production quality framework for high performance DNA analysis in the cloud, which can scale according to the available computational resources by increasing the number of nodes. Our framework uses Apache Spark's ability to cache data in the memory to speed up processing, while also allowing the user to run the framework on systems with lower amounts of memory at the cost of slightly less performance. To manage the memory footprint, we implement an on-the-fly compression method of intermediate data and reduce memory requirements by up to 3x. Our framework also uses a streaming approach to gradually stream input data as processing is taking place. This makes our framework faster than other state of the art approaches while at the same time allowing users to adapt it to run on clusters with lower memory. As compared to the state of the art, SparkGA2 is up to 22% faster on a large big data cluster of 67 nodes and up to 9% faster on a smaller cluster of 6 nodes. Including the streaming solution, where data pre-processing is considered, SparkGA2 is 51% faster on a 6 node cluster. The source code of SparkGA2 is publicly available at https://github.com/HamidMushtaq/SparkGA2.",

author = "Hamid Mushtaq and Nauman Ahmed and Zaid Al-Ars",

year = "2019",

doi = "10.1371/journal.pone.0224784",

language = "English",

volume = "14",

pages = "1--14",

journal = "PLoS ONE",

issn = "1932-6203",

publisher = "Public Library of Science (PLOS)",

number = "12",

}

TY - JOUR

T1 - SparkGA2

T2 - Production-quality memory-efficient Apache Spark based genome analysis framework

AU - Mushtaq, Hamid

AU - Ahmed, Nauman

AU - Al-Ars, Zaid

PY - 2019

Y1 - 2019

N2 - Due to the rapid decrease in the cost of NGS (Next Generation Sequencing), interest has increased in using data generated from NGS to diagnose genetic diseases. However, the data generated by NGS technology is usually in the order of hundreds of gigabytes per experiment, thus requiring efficient and scalable programs to perform data analysis quickly. This paper presents SparkGA2, a memory efficient, production quality framework for high performance DNA analysis in the cloud, which can scale according to the available computational resources by increasing the number of nodes. Our framework uses Apache Spark's ability to cache data in the memory to speed up processing, while also allowing the user to run the framework on systems with lower amounts of memory at the cost of slightly less performance. To manage the memory footprint, we implement an on-the-fly compression method of intermediate data and reduce memory requirements by up to 3x. Our framework also uses a streaming approach to gradually stream input data as processing is taking place. This makes our framework faster than other state of the art approaches while at the same time allowing users to adapt it to run on clusters with lower memory. As compared to the state of the art, SparkGA2 is up to 22% faster on a large big data cluster of 67 nodes and up to 9% faster on a smaller cluster of 6 nodes. Including the streaming solution, where data pre-processing is considered, SparkGA2 is 51% faster on a 6 node cluster. The source code of SparkGA2 is publicly available at https://github.com/HamidMushtaq/SparkGA2.

AB - Due to the rapid decrease in the cost of NGS (Next Generation Sequencing), interest has increased in using data generated from NGS to diagnose genetic diseases. However, the data generated by NGS technology is usually in the order of hundreds of gigabytes per experiment, thus requiring efficient and scalable programs to perform data analysis quickly. This paper presents SparkGA2, a memory efficient, production quality framework for high performance DNA analysis in the cloud, which can scale according to the available computational resources by increasing the number of nodes. Our framework uses Apache Spark's ability to cache data in the memory to speed up processing, while also allowing the user to run the framework on systems with lower amounts of memory at the cost of slightly less performance. To manage the memory footprint, we implement an on-the-fly compression method of intermediate data and reduce memory requirements by up to 3x. Our framework also uses a streaming approach to gradually stream input data as processing is taking place. This makes our framework faster than other state of the art approaches while at the same time allowing users to adapt it to run on clusters with lower memory. As compared to the state of the art, SparkGA2 is up to 22% faster on a large big data cluster of 67 nodes and up to 9% faster on a smaller cluster of 6 nodes. Including the streaming solution, where data pre-processing is considered, SparkGA2 is 51% faster on a 6 node cluster. The source code of SparkGA2 is publicly available at https://github.com/HamidMushtaq/SparkGA2.

UR - http://www.scopus.com/inward/record.url?scp=85076191903&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0224784

DO - 10.1371/journal.pone.0224784

M3 - Article

C2 - 31805063

AN - SCOPUS:85076191903

SN - 1932-6203

VL - 14

SP - 1

EP - 14

JO - PLoS ONE

JF - PLoS ONE

IS - 12

M1 - e0224784

ER -

SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework

Abstract

Access to Document

Other files and links

Fingerprint

Cite this