TY - GEN
T1 - Evaluating POWER Architecture for Distributed Training of Generative Adversarial Networks
AU - Hesam, Ahmad
AU - Vallecorsa, Sofia
AU - Khattak, Gulrukh
AU - Carminati, Federico
PY - 2019
Y1 - 2019
N2 - The increased availability of High-Performance Computing resources can enable data scientists to deploy and evaluate data-driven approaches, notably in the field of deep learning, at a rapid pace. As deep neural networks become more complex and are ingesting increasingly larger datasets, it becomes unpractical to perform the training phase on single machine instances due to memory constraints, and extremely long training time. Rather than scaling up, scaling out the computing resources is a productive approach to improve performance. The paradigm of data parallelism allows us to split the training dataset into manageable chunks that can be processed in parallel. In this work, we evaluate the scaling performance of training a 3D generative adversarial network (GAN) on an IBM POWER8 cluster, equipped with 12 NVIDIA P100 GPUs. The full training duration of the GAN, including evaluation, is reduced from 20 h and 16 min on a single GPU, to 2 h and 14 min on all 12 GPUs. We achieve a scaling efficiency of 98.9% when scaling from 1 to 12 GPUs, taking only the training process into consideration.
AB - The increased availability of High-Performance Computing resources can enable data scientists to deploy and evaluate data-driven approaches, notably in the field of deep learning, at a rapid pace. As deep neural networks become more complex and are ingesting increasingly larger datasets, it becomes unpractical to perform the training phase on single machine instances due to memory constraints, and extremely long training time. Rather than scaling up, scaling out the computing resources is a productive approach to improve performance. The paradigm of data parallelism allows us to split the training dataset into manageable chunks that can be processed in parallel. In this work, we evaluate the scaling performance of training a 3D generative adversarial network (GAN) on an IBM POWER8 cluster, equipped with 12 NVIDIA P100 GPUs. The full training duration of the GAN, including evaluation, is reduced from 20 h and 16 min on a single GPU, to 2 h and 14 min on all 12 GPUs. We achieve a scaling efficiency of 98.9% when scaling from 1 to 12 GPUs, taking only the training process into consideration.
KW - Distributed training
KW - Generative adversarial network
KW - GPU
KW - High Performance Computing
KW - POWER8
UR - http://www.scopus.com/inward/record.url?scp=85076865826&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-34356-9_32
DO - 10.1007/978-3-030-34356-9_32
M3 - Conference contribution
AN - SCOPUS:85076865826
SN - 9783030343552
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 432
EP - 440
BT - High Performance Computing - ISC High Performance 2019 International Workshops, Revised Selected Papers
A2 - Weiland, Michèle
A2 - Juckeland, Guido
A2 - Alam, Sadaf
A2 - Jagode, Heike
PB - Springer
T2 - 34th International Conference on High Performance Computing, ISC High Performance 2019
Y2 - 16 June 2019 through 20 June 2019
ER -