Abstract
The large amount of data generated by NextGeneration Sequencing (NGS) technology, usually in the order of hundreds of gigabytes per experiment, has to be analyzed quickly to generate meaningful variant results. The first step
in analyzing such data is to map those sequenced reads to their corresponding positions in the human genome. One of the most popular tools to do such sequence alignment is the Burrows-Wheeler Aligner (BWA mem). One limitation of the BWA program though is that it cannot be run on a cluster.
In this paper, we propose StreamBWA, a new framework that allows the BWA mem program to run on a cluster in a distributed fashion, at the same time while the input data is being streamed into the cluster. It can process the input
data directly from a compressed file, which either lies on the local file system or on a URL. Moreover, StreamBWA can start combining the output files of the distributed BWA mem tasks at the same time while these tasks are still being executed on the cluster. Empirical evaluation shows that this streaming
distributed approach is approximately 2x faster than the nonstreaming approach. Furthermore, our streaming distributed approach is 5x faster than other state-of-the-art solutions such as SparkBWA. The source code of StreamBWA is publicly available at https://github.com/HamidMushtaq/StreamBWA.
in analyzing such data is to map those sequenced reads to their corresponding positions in the human genome. One of the most popular tools to do such sequence alignment is the Burrows-Wheeler Aligner (BWA mem). One limitation of the BWA program though is that it cannot be run on a cluster.
In this paper, we propose StreamBWA, a new framework that allows the BWA mem program to run on a cluster in a distributed fashion, at the same time while the input data is being streamed into the cluster. It can process the input
data directly from a compressed file, which either lies on the local file system or on a URL. Moreover, StreamBWA can start combining the output files of the distributed BWA mem tasks at the same time while these tasks are still being executed on the cluster. Empirical evaluation shows that this streaming
distributed approach is approximately 2x faster than the nonstreaming approach. Furthermore, our streaming distributed approach is 5x faster than other state-of-the-art solutions such as SparkBWA. The source code of StreamBWA is publicly available at https://github.com/HamidMushtaq/StreamBWA.
Original language | English |
---|---|
Title of host publication | 2017 IEEE 17th International Conference on BioInformatics and BioEngineering (BIBE) |
Place of Publication | Piscataway |
Publisher | IEEE |
Pages | 188-193 |
Number of pages | 6 |
ISBN (Electronic) | 978-1-5386-1324-5 |
ISBN (Print) | 978-1-5386-1325-2 |
DOIs | |
Publication status | Published - 2017 |
Event | BIBE 2017: 17th IEEE International Conference on BioInformatics and BioEngineering - Washington DC, United States Duration: 23 Oct 2017 → 25 Oct 2017 http://bibe2017.com/index.html |
Conference
Conference | BIBE 2017 |
---|---|
Abbreviated title | BIBE 2017 |
Country/Territory | United States |
City | Washington DC |
Period | 23/10/17 → 25/10/17 |
Internet address |
Keywords
- DNA
- Micromechanical devices
- Pipelines
- Tools
- Sparks
- Big Data