Documents

DOI

Though transistor scaling yields more transistors per chip, however, the consistent performance gain due to frequency scaling is no more feasible due to physical limits. These trends shifted the computational paradigm towards integration of more and more processing cores. Multicore computing is challenging, not only because applications need to be parallelized, but also because memory access patterns and inter-core communication need to be carefully analyzed for scalable performance gain.
Another trend in computing is the utilization of heterogeneous cores in the systems, especially in the big data era. Efficient utilization of these heterogeneous architectures is not possible in an architecture agnostic way. Secondly, these systems normally have a deep memory hierarchy which makes the assignment of datastructures to the available memory spaces even more challenging. Developers need to carefully understand and match the inherent memory access patterns of the application to the architecture facilities to gain performance. Manual analysis of applications is tedious and error prone. Therefore, tools are required to characterize the data-communication in an application and highlight the communication hot spots.
In this thesis we present the design of MCProf, a runtime memory-access and datacommunication profiler which helps programmers to perform communication-aware partitioning and mapping decisions based on the detailed quantitative profile of an application. MCProf provides a detailed insight of the data flow in the application and highlights not only the compute intensive parts but also the memory-intensive parts. Experimental results show that on the average, the proposed profiler has at least one order of magnitude less overhead as compared to the state-of-theart data-communication profilers for a variety of benchmarks. Furthermore, the provided information is in relationship to the source-code, making it easy for the developers to utilize the generated information. We present a semi-automatic parallelization methodology based on MCProf to help programmers extract and express parallelism. Later on, we present a framework which automates this process.
To validate the proposed tool, we present the acceleration of several applications as case studies targeting both homogeneous and heterogeneous multicore platforms. In the case of homogeneous multicores, we demonstrate that better performance up to 4× can be achieved by the proposed parallelization methodology when compared to available commercial compilers. In the case of heterogeneous systems using GPU and FPGA as an accelerator, experimental results show significant performance gains due to communication-aware application mapping. For instance, in the case of GPU up to 3× speedup was achieved over the optimized parallel version by utilizing the information generated by MCProf. In the case of reconfigurable platforms, we presented software and hardware based optimizations based on the detailed application’s profile generated by MCProf. Software-based optimizations resulted in an overall speedup of 2.24×. Applying hardware-based optimizations for FPGA resulted in speed up of 1.83× compared to the baseline. This also resulted in about 50% reduction in energy consumption. Finally, we also presented a case-study showing the utilization of the information generated by MCProf to perform data-communication aware evaluation of partitioning algorithm and partitioning solutions.
Original languageEnglish
QualificationDoctor of Philosophy
Supervisors/Advisors
Award date28 Apr 2016
Print ISBNs978-94-6186-633-2
DOIs
Publication statusPublished - 28 Apr 2016

    Research areas

  • data-communication profiling, heterogeneous computing, binary instrumentation, code parallelization, shadow memory

ID: 4311952