Evaluating POWER Architecture for Distributed Training of Generative Adversarial Networks

Ahmad Hesam; Sofia Vallecorsa; Gulrukh Khattak; Federico Carminati

doi:10.1007/978-3-030-34356-9_32

Evaluating POWER Architecture for Distributed Training of Generative Adversarial Networks

Ahmad Hesam^*, Sofia Vallecorsa, Gulrukh Khattak, Federico Carminati

^*Corresponding author for this work

Computer Engineering

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

1 Citation (Scopus)

Abstract

The increased availability of High-Performance Computing resources can enable data scientists to deploy and evaluate data-driven approaches, notably in the field of deep learning, at a rapid pace. As deep neural networks become more complex and are ingesting increasingly larger datasets, it becomes unpractical to perform the training phase on single machine instances due to memory constraints, and extremely long training time. Rather than scaling up, scaling out the computing resources is a productive approach to improve performance. The paradigm of data parallelism allows us to split the training dataset into manageable chunks that can be processed in parallel. In this work, we evaluate the scaling performance of training a 3D generative adversarial network (GAN) on an IBM POWER8 cluster, equipped with 12 NVIDIA P100 GPUs. The full training duration of the GAN, including evaluation, is reduced from 20 h and 16 min on a single GPU, to 2 h and 14 min on all 12 GPUs. We achieve a scaling efficiency of 98.9% when scaling from 1 to 12 GPUs, taking only the training process into consideration.

Original language	English
Title of host publication	High Performance Computing - ISC High Performance 2019 International Workshops, Revised Selected Papers
Editors	Michèle Weiland, Guido Juckeland, Sadaf Alam, Heike Jagode
Publisher	Springer
Pages	432-440
Number of pages	9
ISBN (Print)	9783030343552
DOIs	https://doi.org/10.1007/978-3-030-34356-9_32
Publication status	Published - 2019
Event	34th International Conference on High Performance Computing, ISC High Performance 2019 - Frankfurt, Germany Duration: 16 Jun 2019 → 20 Jun 2019

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	11887 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	34th International Conference on High Performance Computing, ISC High Performance 2019
Country/Territory	Germany
City	Frankfurt
Period	16/06/19 → 20/06/19

Keywords

Distributed training
Generative adversarial network
GPU
High Performance Computing
POWER8

Access to Document

10.1007/978-3-030-34356-9_32

Cite this

Hesam, A., Vallecorsa, S., Khattak, G., & Carminati, F. (2019). Evaluating POWER Architecture for Distributed Training of Generative Adversarial Networks. In M. Weiland, G. Juckeland, S. Alam, & H. Jagode (Eds.), High Performance Computing - ISC High Performance 2019 International Workshops, Revised Selected Papers (pp. 432-440). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11887 LNCS). Springer. https://doi.org/10.1007/978-3-030-34356-9_32

Hesam, Ahmad ; Vallecorsa, Sofia ; Khattak, Gulrukh et al. / Evaluating POWER Architecture for Distributed Training of Generative Adversarial Networks. High Performance Computing - ISC High Performance 2019 International Workshops, Revised Selected Papers. editor / Michèle Weiland ; Guido Juckeland ; Sadaf Alam ; Heike Jagode. Springer, 2019. pp. 432-440 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{030e672c3c29485d97a3d0614b0e013a,

title = "Evaluating POWER Architecture for Distributed Training of Generative Adversarial Networks",

abstract = "The increased availability of High-Performance Computing resources can enable data scientists to deploy and evaluate data-driven approaches, notably in the field of deep learning, at a rapid pace. As deep neural networks become more complex and are ingesting increasingly larger datasets, it becomes unpractical to perform the training phase on single machine instances due to memory constraints, and extremely long training time. Rather than scaling up, scaling out the computing resources is a productive approach to improve performance. The paradigm of data parallelism allows us to split the training dataset into manageable chunks that can be processed in parallel. In this work, we evaluate the scaling performance of training a 3D generative adversarial network (GAN) on an IBM POWER8 cluster, equipped with 12 NVIDIA P100 GPUs. The full training duration of the GAN, including evaluation, is reduced from 20 h and 16 min on a single GPU, to 2 h and 14 min on all 12 GPUs. We achieve a scaling efficiency of 98.9% when scaling from 1 to 12 GPUs, taking only the training process into consideration.",

keywords = "Distributed training, Generative adversarial network, GPU, High Performance Computing, POWER8",

author = "Ahmad Hesam and Sofia Vallecorsa and Gulrukh Khattak and Federico Carminati",

year = "2019",

doi = "10.1007/978-3-030-34356-9_32",

language = "English",

isbn = "9783030343552",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer",

pages = "432--440",

editor = "Mich{\`e}le Weiland and Guido Juckeland and Sadaf Alam and Heike Jagode",

booktitle = "High Performance Computing - ISC High Performance 2019 International Workshops, Revised Selected Papers",

note = "34th International Conference on High Performance Computing, ISC High Performance 2019 ; Conference date: 16-06-2019 Through 20-06-2019",

}

Hesam, A, Vallecorsa, S, Khattak, G & Carminati, F 2019, Evaluating POWER Architecture for Distributed Training of Generative Adversarial Networks. in M Weiland, G Juckeland, S Alam & H Jagode (eds), High Performance Computing - ISC High Performance 2019 International Workshops, Revised Selected Papers. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11887 LNCS, Springer, pp. 432-440, 34th International Conference on High Performance Computing, ISC High Performance 2019, Frankfurt, Germany, 16/06/19. https://doi.org/10.1007/978-3-030-34356-9_32

Evaluating POWER Architecture for Distributed Training of Generative Adversarial Networks. / Hesam, Ahmad; Vallecorsa, Sofia; Khattak, Gulrukh et al.
High Performance Computing - ISC High Performance 2019 International Workshops, Revised Selected Papers. ed. / Michèle Weiland; Guido Juckeland; Sadaf Alam; Heike Jagode. Springer, 2019. p. 432-440 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11887 LNCS).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - Evaluating POWER Architecture for Distributed Training of Generative Adversarial Networks

AU - Hesam, Ahmad

AU - Vallecorsa, Sofia

AU - Khattak, Gulrukh

AU - Carminati, Federico

PY - 2019

Y1 - 2019

N2 - The increased availability of High-Performance Computing resources can enable data scientists to deploy and evaluate data-driven approaches, notably in the field of deep learning, at a rapid pace. As deep neural networks become more complex and are ingesting increasingly larger datasets, it becomes unpractical to perform the training phase on single machine instances due to memory constraints, and extremely long training time. Rather than scaling up, scaling out the computing resources is a productive approach to improve performance. The paradigm of data parallelism allows us to split the training dataset into manageable chunks that can be processed in parallel. In this work, we evaluate the scaling performance of training a 3D generative adversarial network (GAN) on an IBM POWER8 cluster, equipped with 12 NVIDIA P100 GPUs. The full training duration of the GAN, including evaluation, is reduced from 20 h and 16 min on a single GPU, to 2 h and 14 min on all 12 GPUs. We achieve a scaling efficiency of 98.9% when scaling from 1 to 12 GPUs, taking only the training process into consideration.

AB - The increased availability of High-Performance Computing resources can enable data scientists to deploy and evaluate data-driven approaches, notably in the field of deep learning, at a rapid pace. As deep neural networks become more complex and are ingesting increasingly larger datasets, it becomes unpractical to perform the training phase on single machine instances due to memory constraints, and extremely long training time. Rather than scaling up, scaling out the computing resources is a productive approach to improve performance. The paradigm of data parallelism allows us to split the training dataset into manageable chunks that can be processed in parallel. In this work, we evaluate the scaling performance of training a 3D generative adversarial network (GAN) on an IBM POWER8 cluster, equipped with 12 NVIDIA P100 GPUs. The full training duration of the GAN, including evaluation, is reduced from 20 h and 16 min on a single GPU, to 2 h and 14 min on all 12 GPUs. We achieve a scaling efficiency of 98.9% when scaling from 1 to 12 GPUs, taking only the training process into consideration.

KW - Distributed training

KW - Generative adversarial network

KW - GPU

KW - High Performance Computing

KW - POWER8

UR - http://www.scopus.com/inward/record.url?scp=85076865826&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-34356-9_32

DO - 10.1007/978-3-030-34356-9_32

M3 - Conference contribution

AN - SCOPUS:85076865826

SN - 9783030343552

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 432

EP - 440

BT - High Performance Computing - ISC High Performance 2019 International Workshops, Revised Selected Papers

A2 - Weiland, Michèle

A2 - Juckeland, Guido

A2 - Alam, Sadaf

A2 - Jagode, Heike

PB - Springer

T2 - 34th International Conference on High Performance Computing, ISC High Performance 2019

Y2 - 16 June 2019 through 20 June 2019

ER -

Hesam A, Vallecorsa S, Khattak G, Carminati F. Evaluating POWER Architecture for Distributed Training of Generative Adversarial Networks. In Weiland M, Juckeland G, Alam S, Jagode H, editors, High Performance Computing - ISC High Performance 2019 International Workshops, Revised Selected Papers. Springer. 2019. p. 432-440. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-34356-9_32

Evaluating POWER Architecture for Distributed Training of Generative Adversarial Networks

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this