BlockJoin: Efficient Matrix Partitioning Through Joins

Andreas Kunft; Asterios Katsifodimos; Sebastian  Schelter; Tilmann Rabl; Volker Markl

BlockJoin: Efficient Matrix Partitioning Through Joins

Andreas Kunft, Asterios Katsifodimos, Sebastian Schelter, Tilmann Rabl, Volker Markl

Web Information Systems

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

139 Downloads (Pure)

Abstract

Linear algebra operations are at the core of many Machine Learning (ML) programs. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-to- end ML pipelines often consist of (i) relational operators used for joining the input data, (ii) user defined functions used for feature extraction and vectorization, and (iii) linear algebra operators used for model training and cross- validation. Often, these pipelines need to scale out to large datasets. In this case, these pipelines are usually implemented on top of dataflow engines like Hadoop, Spark, or Flink. These dataflow engines implement relational operators on row-partitioned datasets. However, efficient linear algebra operators use block-partitioned matrices. As a result, pipelines combining both kinds of operators require rather expensive changes to the physical representation, in particular re partitioning steps. In this paper, we investigate the potential of reducing shuffling costs by fusing relational and linear algebra operations into specialized physical operators. We present BlockJoin, a distributed join algorithm which directly produces block-partitioned results. To minimize shuffling costs, BlockJoin applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines. Our experimental evaluation shows speedups up to 6× and the skew resistance of BlockJoin compared to state- of-the-art pipelines implemented in Spark.

Original language	English
Title of host publication	Proceedings of the VLDB Endowment
Subtitle of host publication	Proceedings of the 43rd International Conference on Very Large Data Bases
Editors	Peter Boncz, Ken Salem
Publisher	VLDB Endowment
Pages	2061-2072
Number of pages	12
Publication status	Published - 2017
Event	VLDB 2017: 43rd International Conference on Very Large Data Bases - Munich, Germany Duration: 28 Aug 2017 → 1 Sept 2017 Conference number: 43 http://www.vldb.org/2017/

Publication series

Name	Proceedings of the VLDB Endowment
Number	13
Volume	10

Conference

Conference	VLDB 2017
Country/Territory	Germany
City	Munich
Period	28/08/17 → 1/09/17
Internet address	http://www.vldb.org/2017/

Access to Document

BlockJoin PVLDB-17Final published version, 1.2 MBLicence: CC BY-NC-ND

Cite this

@inproceedings{3eafcbe9a7fe400d9c59918707e10cd2,

title = "BlockJoin: Efficient Matrix Partitioning Through Joins",

abstract = "Linear algebra operations are at the core of many Machine Learning (ML) programs. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-to- end ML pipelines often consist of (i) relational operators used for joining the input data, (ii) user defined functions used for feature extraction and vectorization, and (iii) linear algebra operators used for model training and cross- validation. Often, these pipelines need to scale out to large datasets. In this case, these pipelines are usually implemented on top of dataflow engines like Hadoop, Spark, or Flink. These dataflow engines implement relational operators on row-partitioned datasets. However, efficient linear algebra operators use block-partitioned matrices. As a result, pipelines combining both kinds of operators require rather expensive changes to the physical representation, in particular re partitioning steps. In this paper, we investigate the potential of reducing shuffling costs by fusing relational and linear algebra operations into specialized physical operators. We present BlockJoin, a distributed join algorithm which directly produces block-partitioned results. To minimize shuffling costs, BlockJoin applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines. Our experimental evaluation shows speedups up to 6× and the skew resistance of BlockJoin compared to state- of-the-art pipelines implemented in Spark.",

author = "Andreas Kunft and Asterios Katsifodimos and Sebastian Schelter and Tilmann Rabl and Volker Markl",

year = "2017",

language = "English",

series = "Proceedings of the VLDB Endowment",

publisher = "VLDB Endowment",

number = "13",

pages = "2061--2072",

editor = "Peter Boncz and Ken Salem",

booktitle = "Proceedings of the VLDB Endowment",

note = "VLDB 2017 : 43rd International Conference on Very Large Data Bases ; Conference date: 28-08-2017 Through 01-09-2017",

url = "http://www.vldb.org/2017/",

}

Kunft, A, Katsifodimos, A, Schelter, S, Rabl, T & Markl, V 2017, BlockJoin: Efficient Matrix Partitioning Through Joins. in P Boncz & K Salem (eds), Proceedings of the VLDB Endowment: Proceedings of the 43rd International Conference on Very Large Data Bases. Proceedings of the VLDB Endowment, no. 13, vol. 10, VLDB Endowment, pp. 2061-2072, VLDB 2017, Munich, Germany, 28/08/17.

BlockJoin: Efficient Matrix Partitioning Through Joins. / Kunft, Andreas; Katsifodimos, Asterios; Schelter, Sebastian et al.
Proceedings of the VLDB Endowment: Proceedings of the 43rd International Conference on Very Large Data Bases. ed. / Peter Boncz; Ken Salem. VLDB Endowment, 2017. p. 2061-2072 (Proceedings of the VLDB Endowment; Vol. 10, No. 13).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - BlockJoin

T2 - VLDB 2017

AU - Kunft, Andreas

AU - Katsifodimos, Asterios

AU - Schelter, Sebastian

AU - Rabl, Tilmann

AU - Markl, Volker

N1 - Conference code: 43

PY - 2017

Y1 - 2017

N2 - Linear algebra operations are at the core of many Machine Learning (ML) programs. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-to- end ML pipelines often consist of (i) relational operators used for joining the input data, (ii) user defined functions used for feature extraction and vectorization, and (iii) linear algebra operators used for model training and cross- validation. Often, these pipelines need to scale out to large datasets. In this case, these pipelines are usually implemented on top of dataflow engines like Hadoop, Spark, or Flink. These dataflow engines implement relational operators on row-partitioned datasets. However, efficient linear algebra operators use block-partitioned matrices. As a result, pipelines combining both kinds of operators require rather expensive changes to the physical representation, in particular re partitioning steps. In this paper, we investigate the potential of reducing shuffling costs by fusing relational and linear algebra operations into specialized physical operators. We present BlockJoin, a distributed join algorithm which directly produces block-partitioned results. To minimize shuffling costs, BlockJoin applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines. Our experimental evaluation shows speedups up to 6× and the skew resistance of BlockJoin compared to state- of-the-art pipelines implemented in Spark.

AB - Linear algebra operations are at the core of many Machine Learning (ML) programs. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-to- end ML pipelines often consist of (i) relational operators used for joining the input data, (ii) user defined functions used for feature extraction and vectorization, and (iii) linear algebra operators used for model training and cross- validation. Often, these pipelines need to scale out to large datasets. In this case, these pipelines are usually implemented on top of dataflow engines like Hadoop, Spark, or Flink. These dataflow engines implement relational operators on row-partitioned datasets. However, efficient linear algebra operators use block-partitioned matrices. As a result, pipelines combining both kinds of operators require rather expensive changes to the physical representation, in particular re partitioning steps. In this paper, we investigate the potential of reducing shuffling costs by fusing relational and linear algebra operations into specialized physical operators. We present BlockJoin, a distributed join algorithm which directly produces block-partitioned results. To minimize shuffling costs, BlockJoin applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines. Our experimental evaluation shows speedups up to 6× and the skew resistance of BlockJoin compared to state- of-the-art pipelines implemented in Spark.

UR - http://resolver.tudelft.nl/uuid:3eafcbe9-a7fe-400d-9c59-918707e10cd2

M3 - Conference contribution

T3 - Proceedings of the VLDB Endowment

SP - 2061

EP - 2072

BT - Proceedings of the VLDB Endowment

A2 - Boncz, Peter

A2 - Salem, Ken

PB - VLDB Endowment

Y2 - 28 August 2017 through 1 September 2017

ER -

BlockJoin: Efficient Matrix Partitioning Through Joins

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this