A Coflow-based Co-optimization Framework for High-performance Data Analytics

Long Cheng; Ying Wang; Yulong Pei; Dick Epema

doi:10.1109/ICPP.2017.48

A Coflow-based Co-optimization Framework for High-performance Data Analytics

Long Cheng, Ying Wang, Yulong Pei, Dick Epema

Data-Intensive Systems

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

13 Citations (Scopus)

101 Downloads (Pure)

Abstract

Efficient execution of distributed database operators such as joining and aggregating is critical for the performance of big data analytics. With the increase of the compute speedup of modern CPUs, reducing the network
communication time of these operators in large systems is becoming increasingly important, and also challenging current techniques. Significant performance improvements have been achieved by using state-of-the-art methods, such as reducing network traffic designed in the data management domain, and data flow scheduling in the data communications domain.
However, the proposed techniques in both fields just view each other as a black box, and performance gains from a co-optimization perspective have not yet been explored.
In this paper, based on current research in coflow scheduling,
we propose a novel Coflow-based Co-optimization Framework
(CCF), which can co-optimize application-level data movement
and network-level data communications for distributed operators,
and consequently contribute to their performance in
large distributed environments. We present the detailed design
and implementation of CCF, and conduct an experimental
evaluation of CCF using large-scale simulations on large data
joins. Our results demonstrate that CCF can always perform
faster than current approaches on network communications in
large-scale distributed scenarios.

Original language	English
Title of host publication	Proceedings - 46th International Conference on Parallel Processing, ICPP 2017
Place of Publication	Los Alamitos, CA
Publisher	IEEE
Pages	392-401
Number of pages	10
ISBN (Electronic)	978-1-5386-1042-8
DOIs	https://doi.org/10.1109/ICPP.2017.48
Publication status	Published - 2017
Event	ICPP 2017: 46th International Conference on Parallel Processing - Bristol, United Kingdom Duration: 14 Aug 2017 → 17 Aug 2017 Conference number: 46 http://www.icpp-conf.org/2017/index.php

Conference

Conference	ICPP 2017
Country/Territory	United Kingdom
City	Bristol
Period	14/08/17 → 17/08/17
Internet address	http://www.icpp-conf.org/2017/index.php

Keywords

big data
coflow scheduling
distributed joins
network communications
data-intensive applications

Access to Document

10.1109/ICPP.2017.48

ICPP_Cheng_Epema_2017Accepted author manuscript, 263 KBLicence: CC BY-NC-ND

Cite this

@inproceedings{4ffef8f85ca34a47a321933a23ee0282,

title = "A Coflow-based Co-optimization Framework for High-performance Data Analytics",

abstract = "Efficient execution of distributed database operators such as joining and aggregating is critical for the performance of big data analytics. With the increase of the compute speedup of modern CPUs, reducing the networkcommunication time of these operators in large systems is becoming increasingly important, and also challenging current techniques. Significant performance improvements have been achieved by using state-of-the-art methods, such as reducing network traffic designed in the data management domain, and data flow scheduling in the data communications domain.However, the proposed techniques in both fields just view each other as a black box, and performance gains from a co-optimization perspective have not yet been explored. In this paper, based on current research in coflow scheduling,we propose a novel Coflow-based Co-optimization Framework(CCF), which can co-optimize application-level data movementand network-level data communications for distributed operators,and consequently contribute to their performance inlarge distributed environments. We present the detailed designand implementation of CCF, and conduct an experimentalevaluation of CCF using large-scale simulations on large datajoins. Our results demonstrate that CCF can always performfaster than current approaches on network communications inlarge-scale distributed scenarios.",

keywords = "big data, coflow scheduling, distributed joins, network communications, data-intensive applications",

author = "Long Cheng and Ying Wang and Yulong Pei and Dick Epema",

year = "2017",

doi = "10.1109/ICPP.2017.48",

language = "English",

pages = "392--401",

booktitle = "Proceedings - 46th International Conference on Parallel Processing, ICPP 2017",

publisher = "IEEE",

address = "United States",

note = "ICPP 2017 : 46th International Conference on Parallel Processing ; Conference date: 14-08-2017 Through 17-08-2017",

url = "http://www.icpp-conf.org/2017/index.php",

}

A Coflow-based Co-optimization Framework for High-performance Data Analytics. / Cheng, Long; Wang, Ying; Pei, Yulong et al.
Proceedings - 46th International Conference on Parallel Processing, ICPP 2017. Los Alamitos, CA: IEEE, 2017. p. 392-401.

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - A Coflow-based Co-optimization Framework for High-performance Data Analytics

AU - Cheng, Long

AU - Wang, Ying

AU - Pei, Yulong

AU - Epema, Dick

N1 - Conference code: 46

PY - 2017

Y1 - 2017

N2 - Efficient execution of distributed database operators such as joining and aggregating is critical for the performance of big data analytics. With the increase of the compute speedup of modern CPUs, reducing the networkcommunication time of these operators in large systems is becoming increasingly important, and also challenging current techniques. Significant performance improvements have been achieved by using state-of-the-art methods, such as reducing network traffic designed in the data management domain, and data flow scheduling in the data communications domain.However, the proposed techniques in both fields just view each other as a black box, and performance gains from a co-optimization perspective have not yet been explored. In this paper, based on current research in coflow scheduling,we propose a novel Coflow-based Co-optimization Framework(CCF), which can co-optimize application-level data movementand network-level data communications for distributed operators,and consequently contribute to their performance inlarge distributed environments. We present the detailed designand implementation of CCF, and conduct an experimentalevaluation of CCF using large-scale simulations on large datajoins. Our results demonstrate that CCF can always performfaster than current approaches on network communications inlarge-scale distributed scenarios.

AB - Efficient execution of distributed database operators such as joining and aggregating is critical for the performance of big data analytics. With the increase of the compute speedup of modern CPUs, reducing the networkcommunication time of these operators in large systems is becoming increasingly important, and also challenging current techniques. Significant performance improvements have been achieved by using state-of-the-art methods, such as reducing network traffic designed in the data management domain, and data flow scheduling in the data communications domain.However, the proposed techniques in both fields just view each other as a black box, and performance gains from a co-optimization perspective have not yet been explored. In this paper, based on current research in coflow scheduling,we propose a novel Coflow-based Co-optimization Framework(CCF), which can co-optimize application-level data movementand network-level data communications for distributed operators,and consequently contribute to their performance inlarge distributed environments. We present the detailed designand implementation of CCF, and conduct an experimentalevaluation of CCF using large-scale simulations on large datajoins. Our results demonstrate that CCF can always performfaster than current approaches on network communications inlarge-scale distributed scenarios.

KW - big data

KW - coflow scheduling

KW - distributed joins

KW - network communications

KW - data-intensive applications

UR - http://resolver.tudelft.nl/uuid:4ffef8f8-5ca3-4a47-a321-933a23ee0282

U2 - 10.1109/ICPP.2017.48

DO - 10.1109/ICPP.2017.48

M3 - Conference contribution

SP - 392

EP - 401

BT - Proceedings - 46th International Conference on Parallel Processing, ICPP 2017

PB - IEEE

CY - Los Alamitos, CA

T2 - ICPP 2017

Y2 - 14 August 2017 through 17 August 2017

ER -

A Coflow-based Co-optimization Framework for High-performance Data Analytics

Abstract

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this