Communication Driven Mapping of Applications on Multicore Platforms

Imran Ashraf

doi:10.4233/uuid:uuid:7fbff4c6-7803-435e-a441-8e3483b8a21e

Communication Driven Mapping of Applications on Multicore Platforms

Imran Ashraf

Computer Engineering

Research output: Thesis › Dissertation (TU Delft)

75 Downloads (Pure)

Abstract

Though transistor scaling yields more transistors per chip, however, the consistent performance gain due to frequency scaling is no more feasible due to physical limits. These trends shifted the computational paradigm towards integration of more and more processing cores. Multicore computing is challenging, not only because applications need to be parallelized, but also because memory access patterns and inter-core communication need to be carefully analyzed for scalable performance gain.
Another trend in computing is the utilization of heterogeneous cores in the systems, especially in the big data era. Efficient utilization of these heterogeneous architectures is not possible in an architecture agnostic way. Secondly, these systems normally have a deep memory hierarchy which makes the assignment of datastructures to the available memory spaces even more challenging. Developers need to carefully understand and match the inherent memory access patterns of the application to the architecture facilities to gain performance. Manual analysis of applications is tedious and error prone. Therefore, tools are required to characterize the data-communication in an application and highlight the communication hot spots.
In this thesis we present the design of MCProf, a runtime memory-access and datacommunication profiler which helps programmers to perform communication-aware partitioning and mapping decisions based on the detailed quantitative profile of an application. MCProf provides a detailed insight of the data flow in the application and highlights not only the compute intensive parts but also the memory-intensive parts. Experimental results show that on the average, the proposed profiler has at least one order of magnitude less overhead as compared to the state-of-theart data-communication profilers for a variety of benchmarks. Furthermore, the provided information is in relationship to the source-code, making it easy for the developers to utilize the generated information. We present a semi-automatic parallelization methodology based on MCProf to help programmers extract and express parallelism. Later on, we present a framework which automates this process.
To validate the proposed tool, we present the acceleration of several applications as case studies targeting both homogeneous and heterogeneous multicore platforms. In the case of homogeneous multicores, we demonstrate that better performance up to 4× can be achieved by the proposed parallelization methodology when compared to available commercial compilers. In the case of heterogeneous systems using GPU and FPGA as an accelerator, experimental results show significant performance gains due to communication-aware application mapping. For instance, in the case of GPU up to 3× speedup was achieved over the optimized parallel version by utilizing the information generated by MCProf. In the case of reconfigurable platforms, we presented software and hardware based optimizations based on the detailed application’s profile generated by MCProf. Software-based optimizations resulted in an overall speedup of 2.24×. Applying hardware-based optimizations for FPGA resulted in speed up of 1.83× compared to the baseline. This also resulted in about 50% reduction in energy consumption. Finally, we also presented a case-study showing the utilization of the information generated by MCProf to perform data-communication aware evaluation of partitioning algorithm and partitioning solutions.

Original language	English
Qualification	Doctor of Philosophy
Awarding Institution	Delft University of Technology
Supervisors/Advisors	Bertels, Koen, Supervisor
Award date	28 Apr 2016
Print ISBNs	978-94-6186-633-2
DOIs	https://doi.org/10.4233/uuid:uuid:7fbff4c6-7803-435e-a441-8e3483b8a21e
Publication status	Published - 2016

Keywords

data-communication profiling
heterogeneous computing
binary instrumentation
code parallelization
shadow memory

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.4233/uuid:uuid:7fbff4c6-7803-435e-a441-8e3483b8a21e

dissertationFinal published version, 2.75 MB

Cite this

@phdthesis{7fbff4c67803435ea4418e3483b8a21e,

title = "Communication Driven Mapping of Applications on Multicore Platforms",

abstract = "Though transistor scaling yields more transistors per chip, however, the consistent performance gain due to frequency scaling is no more feasible due to physical limits. These trends shifted the computational paradigm towards integration of more and more processing cores. Multicore computing is challenging, not only because applications need to be parallelized, but also because memory access patterns and inter-core communication need to be carefully analyzed for scalable performance gain.Another trend in computing is the utilization of heterogeneous cores in the systems, especially in the big data era. Efficient utilization of these heterogeneous architectures is not possible in an architecture agnostic way. Secondly, these systems normally have a deep memory hierarchy which makes the assignment of datastructures to the available memory spaces even more challenging. Developers need to carefully understand and match the inherent memory access patterns of the application to the architecture facilities to gain performance. Manual analysis of applications is tedious and error prone. Therefore, tools are required to characterize the data-communication in an application and highlight the communication hot spots.In this thesis we present the design of MCProf, a runtime memory-access and datacommunication profiler which helps programmers to perform communication-aware partitioning and mapping decisions based on the detailed quantitative profile of an application. MCProf provides a detailed insight of the data flow in the application and highlights not only the compute intensive parts but also the memory-intensive parts. Experimental results show that on the average, the proposed profiler has at least one order of magnitude less overhead as compared to the state-of-theart data-communication profilers for a variety of benchmarks. Furthermore, the provided information is in relationship to the source-code, making it easy for the developers to utilize the generated information. We present a semi-automatic parallelization methodology based on MCProf to help programmers extract and express parallelism. Later on, we present a framework which automates this process.To validate the proposed tool, we present the acceleration of several applications as case studies targeting both homogeneous and heterogeneous multicore platforms. In the case of homogeneous multicores, we demonstrate that better performance up to 4× can be achieved by the proposed parallelization methodology when compared to available commercial compilers. In the case of heterogeneous systems using GPU and FPGA as an accelerator, experimental results show significant performance gains due to communication-aware application mapping. For instance, in the case of GPU up to 3× speedup was achieved over the optimized parallel version by utilizing the information generated by MCProf. In the case of reconfigurable platforms, we presented software and hardware based optimizations based on the detailed application{\textquoteright}s profile generated by MCProf. Software-based optimizations resulted in an overall speedup of 2.24×. Applying hardware-based optimizations for FPGA resulted in speed up of 1.83× compared to the baseline. This also resulted in about 50% reduction in energy consumption. Finally, we also presented a case-study showing the utilization of the information generated by MCProf to perform data-communication aware evaluation of partitioning algorithm and partitioning solutions.",

keywords = "data-communication profiling, heterogeneous computing, binary instrumentation, code parallelization, shadow memory",

author = "Imran Ashraf",

year = "2016",

doi = "10.4233/uuid:uuid:7fbff4c6-7803-435e-a441-8e3483b8a21e",

language = "English",

isbn = "978-94-6186-633-2",

type = "Dissertation (TU Delft)",

school = "Delft University of Technology",

}

TY - THES

T1 - Communication Driven Mapping of Applications on Multicore Platforms

AU - Ashraf, Imran

PY - 2016

Y1 - 2016

N2 - Though transistor scaling yields more transistors per chip, however, the consistent performance gain due to frequency scaling is no more feasible due to physical limits. These trends shifted the computational paradigm towards integration of more and more processing cores. Multicore computing is challenging, not only because applications need to be parallelized, but also because memory access patterns and inter-core communication need to be carefully analyzed for scalable performance gain.Another trend in computing is the utilization of heterogeneous cores in the systems, especially in the big data era. Efficient utilization of these heterogeneous architectures is not possible in an architecture agnostic way. Secondly, these systems normally have a deep memory hierarchy which makes the assignment of datastructures to the available memory spaces even more challenging. Developers need to carefully understand and match the inherent memory access patterns of the application to the architecture facilities to gain performance. Manual analysis of applications is tedious and error prone. Therefore, tools are required to characterize the data-communication in an application and highlight the communication hot spots.In this thesis we present the design of MCProf, a runtime memory-access and datacommunication profiler which helps programmers to perform communication-aware partitioning and mapping decisions based on the detailed quantitative profile of an application. MCProf provides a detailed insight of the data flow in the application and highlights not only the compute intensive parts but also the memory-intensive parts. Experimental results show that on the average, the proposed profiler has at least one order of magnitude less overhead as compared to the state-of-theart data-communication profilers for a variety of benchmarks. Furthermore, the provided information is in relationship to the source-code, making it easy for the developers to utilize the generated information. We present a semi-automatic parallelization methodology based on MCProf to help programmers extract and express parallelism. Later on, we present a framework which automates this process.To validate the proposed tool, we present the acceleration of several applications as case studies targeting both homogeneous and heterogeneous multicore platforms. In the case of homogeneous multicores, we demonstrate that better performance up to 4× can be achieved by the proposed parallelization methodology when compared to available commercial compilers. In the case of heterogeneous systems using GPU and FPGA as an accelerator, experimental results show significant performance gains due to communication-aware application mapping. For instance, in the case of GPU up to 3× speedup was achieved over the optimized parallel version by utilizing the information generated by MCProf. In the case of reconfigurable platforms, we presented software and hardware based optimizations based on the detailed application’s profile generated by MCProf. Software-based optimizations resulted in an overall speedup of 2.24×. Applying hardware-based optimizations for FPGA resulted in speed up of 1.83× compared to the baseline. This also resulted in about 50% reduction in energy consumption. Finally, we also presented a case-study showing the utilization of the information generated by MCProf to perform data-communication aware evaluation of partitioning algorithm and partitioning solutions.

AB - Though transistor scaling yields more transistors per chip, however, the consistent performance gain due to frequency scaling is no more feasible due to physical limits. These trends shifted the computational paradigm towards integration of more and more processing cores. Multicore computing is challenging, not only because applications need to be parallelized, but also because memory access patterns and inter-core communication need to be carefully analyzed for scalable performance gain.Another trend in computing is the utilization of heterogeneous cores in the systems, especially in the big data era. Efficient utilization of these heterogeneous architectures is not possible in an architecture agnostic way. Secondly, these systems normally have a deep memory hierarchy which makes the assignment of datastructures to the available memory spaces even more challenging. Developers need to carefully understand and match the inherent memory access patterns of the application to the architecture facilities to gain performance. Manual analysis of applications is tedious and error prone. Therefore, tools are required to characterize the data-communication in an application and highlight the communication hot spots.In this thesis we present the design of MCProf, a runtime memory-access and datacommunication profiler which helps programmers to perform communication-aware partitioning and mapping decisions based on the detailed quantitative profile of an application. MCProf provides a detailed insight of the data flow in the application and highlights not only the compute intensive parts but also the memory-intensive parts. Experimental results show that on the average, the proposed profiler has at least one order of magnitude less overhead as compared to the state-of-theart data-communication profilers for a variety of benchmarks. Furthermore, the provided information is in relationship to the source-code, making it easy for the developers to utilize the generated information. We present a semi-automatic parallelization methodology based on MCProf to help programmers extract and express parallelism. Later on, we present a framework which automates this process.To validate the proposed tool, we present the acceleration of several applications as case studies targeting both homogeneous and heterogeneous multicore platforms. In the case of homogeneous multicores, we demonstrate that better performance up to 4× can be achieved by the proposed parallelization methodology when compared to available commercial compilers. In the case of heterogeneous systems using GPU and FPGA as an accelerator, experimental results show significant performance gains due to communication-aware application mapping. For instance, in the case of GPU up to 3× speedup was achieved over the optimized parallel version by utilizing the information generated by MCProf. In the case of reconfigurable platforms, we presented software and hardware based optimizations based on the detailed application’s profile generated by MCProf. Software-based optimizations resulted in an overall speedup of 2.24×. Applying hardware-based optimizations for FPGA resulted in speed up of 1.83× compared to the baseline. This also resulted in about 50% reduction in energy consumption. Finally, we also presented a case-study showing the utilization of the information generated by MCProf to perform data-communication aware evaluation of partitioning algorithm and partitioning solutions.

KW - data-communication profiling

KW - heterogeneous computing

KW - binary instrumentation

KW - code parallelization

KW - shadow memory

UR - http://resolver.tudelft.nl/uuid:7fbff4c6-7803-435e-a441-8e3483b8a21e

U2 - 10.4233/uuid:uuid:7fbff4c6-7803-435e-a441-8e3483b8a21e

DO - 10.4233/uuid:uuid:7fbff4c6-7803-435e-a441-8e3483b8a21e

M3 - Dissertation (TU Delft)

SN - 978-94-6186-633-2

ER -

Communication Driven Mapping of Applications on Multicore Platforms

Abstract

Keywords

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this