Muses: Distributed data migration system for polystores

Abdulrahman Kaitoua; Tilmann Rabl; Asterios Katsifodimos; Volker Markl

doi:10.1109/ICDE.2019.00152

Muses: Distributed data migration system for polystores

Abdulrahman Kaitoua, Tilmann Rabl, Asterios Katsifodimos, Volker Markl

Web Information Systems

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

5 Citations (Scopus)

3 Downloads (Pure)

Abstract

Large datasets can originate from various sources and are being stored in heterogeneous formats, schemas, and locations. Typical data science tasks need to combine those datasets in order to increase their value and extract knowledge. This is done in various data processing systems with diverse execution engines. In order to take advantage of each execution engine's characteristics and APIs data scientists need to migrate and transform their datasets at a very high computational cost and manual labor. Data migration is challenging for two main reasons: i) execution engines expect specific types/shapes of the data as input; ii) there are various physical representations of the data (e.g., partitions). Therefore, migrating data efficiently requires knowledge of systems internals and assumptions. In this paper we present Muses, a distributed, high-performance data migration engine that is able to forward, transform, repartition, and broadcast data between distributed engines' instances efficiently. Muses does not require any changes in the underlying execution engines. In an experimental evaluation, we show that migrating data from one execution engine to another (in order to take advantage of faster, native operations) can increase a pipeline's performance by 30%.

Original language	English
Title of host publication	2019 IEEE 35th International Conference on Data Engineering (ICDE)
Subtitle of host publication	Proceedings
Publisher	IEEE
Pages	1602-1605
Number of pages	4
ISBN (Electronic)	978-1-5386-7474-1
ISBN (Print)	978-1-5386-7475-8
DOIs	https://doi.org/10.1109/ICDE.2019.00152
Publication status	Published - 2019
Event	35th IEEE International Conference on Data Engineering, ICDE 2019 - Macau, China Duration: 8 Apr 2019 → 11 Apr 2019

Conference

Conference	35th IEEE International Conference on Data Engineering, ICDE 2019
Country/Territory	China
City	Macau
Period	8/04/19 → 11/04/19

Keywords

Big data engine
Data integration
Data migration
Data transformation
Distributed systems

Access to Document

10.1109/ICDE.2019.00152

Cite this

@inproceedings{01734b84d6844086b3baddfded793898,

title = "Muses: Distributed data migration system for polystores",

abstract = "Large datasets can originate from various sources and are being stored in heterogeneous formats, schemas, and locations. Typical data science tasks need to combine those datasets in order to increase their value and extract knowledge. This is done in various data processing systems with diverse execution engines. In order to take advantage of each execution engine's characteristics and APIs data scientists need to migrate and transform their datasets at a very high computational cost and manual labor. Data migration is challenging for two main reasons: i) execution engines expect specific types/shapes of the data as input; ii) there are various physical representations of the data (e.g., partitions). Therefore, migrating data efficiently requires knowledge of systems internals and assumptions. In this paper we present Muses, a distributed, high-performance data migration engine that is able to forward, transform, repartition, and broadcast data between distributed engines' instances efficiently. Muses does not require any changes in the underlying execution engines. In an experimental evaluation, we show that migrating data from one execution engine to another (in order to take advantage of faster, native operations) can increase a pipeline's performance by 30%.",

keywords = "Big data engine, Data integration, Data migration, Data transformation, Distributed systems",

author = "Abdulrahman Kaitoua and Tilmann Rabl and Asterios Katsifodimos and Volker Markl",

year = "2019",

doi = "10.1109/ICDE.2019.00152",

language = "English",

isbn = "978-1-5386-7475-8",

pages = "1602--1605",

booktitle = "2019 IEEE 35th International Conference on Data Engineering (ICDE)",

publisher = "IEEE",

address = "United States",

note = "35th IEEE International Conference on Data Engineering, ICDE 2019 ; Conference date: 08-04-2019 Through 11-04-2019",

}

TY - GEN

T1 - Muses

T2 - 35th IEEE International Conference on Data Engineering, ICDE 2019

AU - Kaitoua, Abdulrahman

AU - Rabl, Tilmann

AU - Katsifodimos, Asterios

AU - Markl, Volker

PY - 2019

Y1 - 2019

N2 - Large datasets can originate from various sources and are being stored in heterogeneous formats, schemas, and locations. Typical data science tasks need to combine those datasets in order to increase their value and extract knowledge. This is done in various data processing systems with diverse execution engines. In order to take advantage of each execution engine's characteristics and APIs data scientists need to migrate and transform their datasets at a very high computational cost and manual labor. Data migration is challenging for two main reasons: i) execution engines expect specific types/shapes of the data as input; ii) there are various physical representations of the data (e.g., partitions). Therefore, migrating data efficiently requires knowledge of systems internals and assumptions. In this paper we present Muses, a distributed, high-performance data migration engine that is able to forward, transform, repartition, and broadcast data between distributed engines' instances efficiently. Muses does not require any changes in the underlying execution engines. In an experimental evaluation, we show that migrating data from one execution engine to another (in order to take advantage of faster, native operations) can increase a pipeline's performance by 30%.

AB - Large datasets can originate from various sources and are being stored in heterogeneous formats, schemas, and locations. Typical data science tasks need to combine those datasets in order to increase their value and extract knowledge. This is done in various data processing systems with diverse execution engines. In order to take advantage of each execution engine's characteristics and APIs data scientists need to migrate and transform their datasets at a very high computational cost and manual labor. Data migration is challenging for two main reasons: i) execution engines expect specific types/shapes of the data as input; ii) there are various physical representations of the data (e.g., partitions). Therefore, migrating data efficiently requires knowledge of systems internals and assumptions. In this paper we present Muses, a distributed, high-performance data migration engine that is able to forward, transform, repartition, and broadcast data between distributed engines' instances efficiently. Muses does not require any changes in the underlying execution engines. In an experimental evaluation, we show that migrating data from one execution engine to another (in order to take advantage of faster, native operations) can increase a pipeline's performance by 30%.

KW - Big data engine

KW - Data integration

KW - Data migration

KW - Data transformation

KW - Distributed systems

UR - http://www.scopus.com/inward/record.url?scp=85068010357&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2019.00152

DO - 10.1109/ICDE.2019.00152

M3 - Conference contribution

AN - SCOPUS:85068010357

SN - 978-1-5386-7475-8

SP - 1602

EP - 1605

BT - 2019 IEEE 35th International Conference on Data Engineering (ICDE)

PB - IEEE

Y2 - 8 April 2019 through 11 April 2019

ER -

Muses: Distributed data migration system for polystores

Abstract

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this