SparkRA: Enabling big data scalability for the GATK RNA-seq pipeline with apache spark

Zaid Al-Ars; Saiyi Wang; Hamid Mushtaq

doi:10.3390/genes11010053

SparkRA: Enabling big data scalability for the GATK RNA-seq pipeline with apache spark

Zaid Al-Ars^*, Saiyi Wang, Hamid Mushtaq

^*Corresponding author for this work

Research output: Contribution to journal › Article › Scientific › peer-review

7 Citations (Scopus)

83 Downloads (Pure)

Abstract

The rapid proliferation of low-cost RNA-seq data has resulted in a growing interest in RNA analysis techniques for various applications, ranging from identifying genotype–phenotype relationships to validating discoveries of other analysis results. However, many practical applications in this field are limited by the available computational resources and associated long computing time needed to perform the analysis. GATK has a popular best practices pipeline specifically designed for variant calling RNA-seq analysis. Some tools in this pipeline are not optimized to scale the analysis to multiple processors or compute nodes efficiently, thereby limiting their ability to process large datasets. In this paper, we present SparkRA, an Apache Spark based pipeline to efficiently scale up the GATK RNA-seq variant calling pipeline on multiple cores in one node or in a large cluster. On a single node with 20 hyper-threaded cores, the original pipeline runs for more than 5 h to process a dataset of 32 GB. In contrast, SparkRA is able to reduce the overall computation time of the pipeline on the same single node by about 4×, reducing the computation time down to 1.3 h. On a cluster with 16 nodes (each with eight single-threaded cores), SparkRA is able to further reduce this computation time by 7.7× compared to a single node. Compared to other scalable state-of-the-art solutions, SparkRA is 1.2× faster while achieving the same accuracy of the results.

Original language	English
Article number	53
Journal	Genes
Volume	11
Issue number	1
DOIs	https://doi.org/10.3390/genes11010053
Publication status	Published - 1 Jan 2020

Keywords

Apache Spark
Computation time
GATK variant calling
RNA-seq
Scalability

Access to Document

10.3390/genes11010053

genes-11-00053Final published version, 941 KBLicence: CC BY

Cite this

@article{69a11ae5e40141d99ffbff8560e71666,

title = "SparkRA: Enabling big data scalability for the GATK RNA-seq pipeline with apache spark",

abstract = "The rapid proliferation of low-cost RNA-seq data has resulted in a growing interest in RNA analysis techniques for various applications, ranging from identifying genotype–phenotype relationships to validating discoveries of other analysis results. However, many practical applications in this field are limited by the available computational resources and associated long computing time needed to perform the analysis. GATK has a popular best practices pipeline specifically designed for variant calling RNA-seq analysis. Some tools in this pipeline are not optimized to scale the analysis to multiple processors or compute nodes efficiently, thereby limiting their ability to process large datasets. In this paper, we present SparkRA, an Apache Spark based pipeline to efficiently scale up the GATK RNA-seq variant calling pipeline on multiple cores in one node or in a large cluster. On a single node with 20 hyper-threaded cores, the original pipeline runs for more than 5 h to process a dataset of 32 GB. In contrast, SparkRA is able to reduce the overall computation time of the pipeline on the same single node by about 4×, reducing the computation time down to 1.3 h. On a cluster with 16 nodes (each with eight single-threaded cores), SparkRA is able to further reduce this computation time by 7.7× compared to a single node. Compared to other scalable state-of-the-art solutions, SparkRA is 1.2× faster while achieving the same accuracy of the results.",

keywords = "Apache Spark, Computation time, GATK variant calling, RNA-seq, Scalability",

author = "Zaid Al-Ars and Saiyi Wang and Hamid Mushtaq",

year = "2020",

month = jan,

day = "1",

doi = "10.3390/genes11010053",

language = "English",

volume = "11",

journal = "Genes",

issn = "2073-4425",

publisher = "MDPI",

number = "1",

}

TY - JOUR

T1 - SparkRA

T2 - Enabling big data scalability for the GATK RNA-seq pipeline with apache spark

AU - Al-Ars, Zaid

AU - Wang, Saiyi

AU - Mushtaq, Hamid

PY - 2020/1/1

Y1 - 2020/1/1

N2 - The rapid proliferation of low-cost RNA-seq data has resulted in a growing interest in RNA analysis techniques for various applications, ranging from identifying genotype–phenotype relationships to validating discoveries of other analysis results. However, many practical applications in this field are limited by the available computational resources and associated long computing time needed to perform the analysis. GATK has a popular best practices pipeline specifically designed for variant calling RNA-seq analysis. Some tools in this pipeline are not optimized to scale the analysis to multiple processors or compute nodes efficiently, thereby limiting their ability to process large datasets. In this paper, we present SparkRA, an Apache Spark based pipeline to efficiently scale up the GATK RNA-seq variant calling pipeline on multiple cores in one node or in a large cluster. On a single node with 20 hyper-threaded cores, the original pipeline runs for more than 5 h to process a dataset of 32 GB. In contrast, SparkRA is able to reduce the overall computation time of the pipeline on the same single node by about 4×, reducing the computation time down to 1.3 h. On a cluster with 16 nodes (each with eight single-threaded cores), SparkRA is able to further reduce this computation time by 7.7× compared to a single node. Compared to other scalable state-of-the-art solutions, SparkRA is 1.2× faster while achieving the same accuracy of the results.

AB - The rapid proliferation of low-cost RNA-seq data has resulted in a growing interest in RNA analysis techniques for various applications, ranging from identifying genotype–phenotype relationships to validating discoveries of other analysis results. However, many practical applications in this field are limited by the available computational resources and associated long computing time needed to perform the analysis. GATK has a popular best practices pipeline specifically designed for variant calling RNA-seq analysis. Some tools in this pipeline are not optimized to scale the analysis to multiple processors or compute nodes efficiently, thereby limiting their ability to process large datasets. In this paper, we present SparkRA, an Apache Spark based pipeline to efficiently scale up the GATK RNA-seq variant calling pipeline on multiple cores in one node or in a large cluster. On a single node with 20 hyper-threaded cores, the original pipeline runs for more than 5 h to process a dataset of 32 GB. In contrast, SparkRA is able to reduce the overall computation time of the pipeline on the same single node by about 4×, reducing the computation time down to 1.3 h. On a cluster with 16 nodes (each with eight single-threaded cores), SparkRA is able to further reduce this computation time by 7.7× compared to a single node. Compared to other scalable state-of-the-art solutions, SparkRA is 1.2× faster while achieving the same accuracy of the results.

KW - Apache Spark

KW - Computation time

KW - GATK variant calling

KW - RNA-seq

KW - Scalability

UR - http://www.scopus.com/inward/record.url?scp=85077701381&partnerID=8YFLogxK

U2 - 10.3390/genes11010053

DO - 10.3390/genes11010053

M3 - Article

AN - SCOPUS:85077701381

SN - 2073-4425

VL - 11

JO - Genes

JF - Genes

IS - 1

M1 - 53

ER -

SparkRA: Enabling big data scalability for the GATK RNA-seq pipeline with apache spark

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this