Improved Dynamic Cache Sharing for Communicating Threads on a Runtime-Adaptable Processor

Joost Hoozemans; Arthur Lorenzon; Antonio Carlos Schneider Beck; Stephan Wong

Abstract

Abstract—Multi-threaded applications execute their threads on different cores with their own local caches and need to share data among the threads. Shared caches are used to avoid lengthy and costly main memory accesses. The degree of cache sharing is a balance between reducing misses and increased hit latency.
Dynamic caches have been proposed to adapt this balance to the workload type. Similarly, dynamic processors aim to execute workloads as efficient as possible to being able to balance between exploiting Instruction-level parallelism (ILP) and Thread-level parallelism (TLP). To support this, they consist of multiple processing components and caches that have adaptable interconnects
between them. Depending on the workload characteristics, these can connect them together to form a large core that exploits ILP, or split them up to form multiple cores that can run multiple threads (exploiting TLP). In this paper, we propose a cache system that is able to further exploit this additional connectivity
of a dynamic VLIW processor by being able to forward cache accesses to multiple cache blocks while the processor is running in multi-threaded (‘split’) mode. Additionally, only requests to global data are broadcasted, while accesses to local data are kept private. This will improve the hit rates similar to existing cache
sharing schemes, but reduce the penalty due to stalling the other subcores. Local accesses are recognized by distinguishing memory accesses relative to the stack frame pointer. Results show that our cache exhibits similar miss rate reductions as shared caches (up to 90% and on average 26%), and reduces the number of
broadcasted accesses by 21%.

Original language	English
Pages	1-9
Number of pages	9
Publication status	Published - Jan 2017
Event	Workshop Reconfigurable Computing 2017: 11th HiPEAC Workshop - HiPEAC 2017, Stockholm, Sweden Duration: 23 Jan 2017 → 23 Jan 2017 https://www.hipeac.net/events/activities/7441/wrc/#fndtn-main

Workshop

Workshop	Workshop Reconfigurable Computing 2017
Abbreviated title	WRC
Country/Territory	Sweden
City	Stockholm
Period	23/01/17 → 23/01/17
Internet address	https://www.hipeac.net/events/activities/7441/wrc/#fndtn-main

Access to Document

Improved_dynamic_cache_sharing_for_communicating_threadsAccepted author manuscript, 327 KB

Cite this

@conference{09d29f012ea1451d90e0ebd751ee499f,

title = "Improved Dynamic Cache Sharing for Communicating Threads on a Runtime-Adaptable Processor",

abstract = "Abstract—Multi-threaded applications execute their threads on different cores with their own local caches and need to share data among the threads. Shared caches are used to avoid lengthy and costly main memory accesses. The degree of cache sharing is a balance between reducing misses and increased hit latency.Dynamic caches have been proposed to adapt this balance to the workload type. Similarly, dynamic processors aim to execute workloads as efficient as possible to being able to balance between exploiting Instruction-level parallelism (ILP) and Thread-level parallelism (TLP). To support this, they consist of multiple processing components and caches that have adaptable interconnectsbetween them. Depending on the workload characteristics, these can connect them together to form a large core that exploits ILP, or split them up to form multiple cores that can run multiple threads (exploiting TLP). In this paper, we propose a cache system that is able to further exploit this additional connectivityof a dynamic VLIW processor by being able to forward cache accesses to multiple cache blocks while the processor is running in multi-threaded ({\textquoteleft}split{\textquoteright}) mode. Additionally, only requests to global data are broadcasted, while accesses to local data are kept private. This will improve the hit rates similar to existing cachesharing schemes, but reduce the penalty due to stalling the other subcores. Local accesses are recognized by distinguishing memory accesses relative to the stack frame pointer. Results show that our cache exhibits similar miss rate reductions as shared caches (up to 90% and on average 26%), and reduces the number ofbroadcasted accesses by 21%.",

author = "Joost Hoozemans and Arthur Lorenzon and {Schneider Beck}, {Antonio Carlos} and Stephan Wong",

year = "2017",

month = jan,

language = "English",

pages = "1--9",

note = "Workshop Reconfigurable Computing 2017 : 11th HiPEAC Workshop , WRC ; Conference date: 23-01-2017 Through 23-01-2017",

url = "https://www.hipeac.net/events/activities/7441/wrc/#fndtn-main",

}

TY - CONF

T1 - Improved Dynamic Cache Sharing for Communicating Threads on a Runtime-Adaptable Processor

AU - Hoozemans, Joost

AU - Lorenzon, Arthur

AU - Schneider Beck, Antonio Carlos

AU - Wong, Stephan

PY - 2017/1

Y1 - 2017/1

N2 - Abstract—Multi-threaded applications execute their threads on different cores with their own local caches and need to share data among the threads. Shared caches are used to avoid lengthy and costly main memory accesses. The degree of cache sharing is a balance between reducing misses and increased hit latency.Dynamic caches have been proposed to adapt this balance to the workload type. Similarly, dynamic processors aim to execute workloads as efficient as possible to being able to balance between exploiting Instruction-level parallelism (ILP) and Thread-level parallelism (TLP). To support this, they consist of multiple processing components and caches that have adaptable interconnectsbetween them. Depending on the workload characteristics, these can connect them together to form a large core that exploits ILP, or split them up to form multiple cores that can run multiple threads (exploiting TLP). In this paper, we propose a cache system that is able to further exploit this additional connectivityof a dynamic VLIW processor by being able to forward cache accesses to multiple cache blocks while the processor is running in multi-threaded (‘split’) mode. Additionally, only requests to global data are broadcasted, while accesses to local data are kept private. This will improve the hit rates similar to existing cachesharing schemes, but reduce the penalty due to stalling the other subcores. Local accesses are recognized by distinguishing memory accesses relative to the stack frame pointer. Results show that our cache exhibits similar miss rate reductions as shared caches (up to 90% and on average 26%), and reduces the number ofbroadcasted accesses by 21%.

AB - Abstract—Multi-threaded applications execute their threads on different cores with their own local caches and need to share data among the threads. Shared caches are used to avoid lengthy and costly main memory accesses. The degree of cache sharing is a balance between reducing misses and increased hit latency.Dynamic caches have been proposed to adapt this balance to the workload type. Similarly, dynamic processors aim to execute workloads as efficient as possible to being able to balance between exploiting Instruction-level parallelism (ILP) and Thread-level parallelism (TLP). To support this, they consist of multiple processing components and caches that have adaptable interconnectsbetween them. Depending on the workload characteristics, these can connect them together to form a large core that exploits ILP, or split them up to form multiple cores that can run multiple threads (exploiting TLP). In this paper, we propose a cache system that is able to further exploit this additional connectivityof a dynamic VLIW processor by being able to forward cache accesses to multiple cache blocks while the processor is running in multi-threaded (‘split’) mode. Additionally, only requests to global data are broadcasted, while accesses to local data are kept private. This will improve the hit rates similar to existing cachesharing schemes, but reduce the penalty due to stalling the other subcores. Local accesses are recognized by distinguishing memory accesses relative to the stack frame pointer. Results show that our cache exhibits similar miss rate reductions as shared caches (up to 90% and on average 26%), and reduces the number ofbroadcasted accesses by 21%.

M3 - Abstract

SP - 1

EP - 9

T2 - Workshop Reconfigurable Computing 2017

Y2 - 23 January 2017 through 23 January 2017

ER -

Improved Dynamic Cache Sharing for Communicating Threads on a Runtime-Adaptable Processor

Abstract

Workshop

Access to Document

Fingerprint

Cite this