Standard

Improved Dynamic Cache Sharing for Communicating Threads on a Runtime-Adaptable Processor. / Hoozemans, Joost; Lorenzon, Arthur; Schneider Beck, Antonio Carlos; Wong, Stephan.

2017. 1-9 Abstract from Workshop Reconfigurable Computing 2017, Stockholm, Sweden.

Research output: Contribution to conferenceAbstractScientific

Harvard

Hoozemans, J, Lorenzon, A, Schneider Beck, AC & Wong, S 2017, 'Improved Dynamic Cache Sharing for Communicating Threads on a Runtime-Adaptable Processor' Workshop Reconfigurable Computing 2017, Stockholm, Sweden, 23/01/17 - 23/01/17, pp. 1-9.

APA

Vancouver

Author

BibTeX

@conference{09d29f012ea1451d90e0ebd751ee499f,
title = "Improved Dynamic Cache Sharing for Communicating Threads on a Runtime-Adaptable Processor",
abstract = "Abstract—Multi-threaded applications execute their threads on different cores with their own local caches and need to share data among the threads. Shared caches are used to avoid lengthy and costly main memory accesses. The degree of cache sharing is a balance between reducing misses and increased hit latency.Dynamic caches have been proposed to adapt this balance to the workload type. Similarly, dynamic processors aim to execute workloads as efficient as possible to being able to balance between exploiting Instruction-level parallelism (ILP) and Thread-level parallelism (TLP). To support this, they consist of multiple processing components and caches that have adaptable interconnectsbetween them. Depending on the workload characteristics, these can connect them together to form a large core that exploits ILP, or split them up to form multiple cores that can run multiple threads (exploiting TLP). In this paper, we propose a cache system that is able to further exploit this additional connectivityof a dynamic VLIW processor by being able to forward cache accesses to multiple cache blocks while the processor is running in multi-threaded (‘split’) mode. Additionally, only requests to global data are broadcasted, while accesses to local data are kept private. This will improve the hit rates similar to existing cachesharing schemes, but reduce the penalty due to stalling the other subcores. Local accesses are recognized by distinguishing memory accesses relative to the stack frame pointer. Results show that our cache exhibits similar miss rate reductions as shared caches (up to 90{\%} and on average 26{\%}), and reduces the number ofbroadcasted accesses by 21{\%}.",
author = "Joost Hoozemans and Arthur Lorenzon and {Schneider Beck}, {Antonio Carlos} and Stephan Wong",
year = "2017",
month = "1",
language = "English",
pages = "1--9",
note = "Workshop Reconfigurable Computing 2017 : 11th HiPEAC Workshop , WRC ; Conference date: 23-01-2017 Through 23-01-2017",
url = "https://www.hipeac.net/events/activities/7441/wrc/#fndtn-main",

}

RIS

TY - CONF

T1 - Improved Dynamic Cache Sharing for Communicating Threads on a Runtime-Adaptable Processor

AU - Hoozemans, Joost

AU - Lorenzon, Arthur

AU - Schneider Beck, Antonio Carlos

AU - Wong, Stephan

PY - 2017/1

Y1 - 2017/1

N2 - Abstract—Multi-threaded applications execute their threads on different cores with their own local caches and need to share data among the threads. Shared caches are used to avoid lengthy and costly main memory accesses. The degree of cache sharing is a balance between reducing misses and increased hit latency.Dynamic caches have been proposed to adapt this balance to the workload type. Similarly, dynamic processors aim to execute workloads as efficient as possible to being able to balance between exploiting Instruction-level parallelism (ILP) and Thread-level parallelism (TLP). To support this, they consist of multiple processing components and caches that have adaptable interconnectsbetween them. Depending on the workload characteristics, these can connect them together to form a large core that exploits ILP, or split them up to form multiple cores that can run multiple threads (exploiting TLP). In this paper, we propose a cache system that is able to further exploit this additional connectivityof a dynamic VLIW processor by being able to forward cache accesses to multiple cache blocks while the processor is running in multi-threaded (‘split’) mode. Additionally, only requests to global data are broadcasted, while accesses to local data are kept private. This will improve the hit rates similar to existing cachesharing schemes, but reduce the penalty due to stalling the other subcores. Local accesses are recognized by distinguishing memory accesses relative to the stack frame pointer. Results show that our cache exhibits similar miss rate reductions as shared caches (up to 90% and on average 26%), and reduces the number ofbroadcasted accesses by 21%.

AB - Abstract—Multi-threaded applications execute their threads on different cores with their own local caches and need to share data among the threads. Shared caches are used to avoid lengthy and costly main memory accesses. The degree of cache sharing is a balance between reducing misses and increased hit latency.Dynamic caches have been proposed to adapt this balance to the workload type. Similarly, dynamic processors aim to execute workloads as efficient as possible to being able to balance between exploiting Instruction-level parallelism (ILP) and Thread-level parallelism (TLP). To support this, they consist of multiple processing components and caches that have adaptable interconnectsbetween them. Depending on the workload characteristics, these can connect them together to form a large core that exploits ILP, or split them up to form multiple cores that can run multiple threads (exploiting TLP). In this paper, we propose a cache system that is able to further exploit this additional connectivityof a dynamic VLIW processor by being able to forward cache accesses to multiple cache blocks while the processor is running in multi-threaded (‘split’) mode. Additionally, only requests to global data are broadcasted, while accesses to local data are kept private. This will improve the hit rates similar to existing cachesharing schemes, but reduce the penalty due to stalling the other subcores. Local accesses are recognized by distinguishing memory accesses relative to the stack frame pointer. Results show that our cache exhibits similar miss rate reductions as shared caches (up to 90% and on average 26%), and reduces the number ofbroadcasted accesses by 21%.

M3 - Abstract

SP - 1

EP - 9

ER -

ID: 30883476