Research output: Scientific - peer-review › Article

**BlockJoin: Efficient Matrix Partitioning Through Joins.** / Kunft, Andreas; Katsifodimos, Asterios; Schelter, Sebastian ; Rabl, Tilmann; Markl, Volker.

Research output: Scientific - peer-review › Article

Kunft, A, Katsifodimos, A, Schelter, S, Rabl, T & Markl, V 2017, 'BlockJoin: Efficient Matrix Partitioning Through Joins' *Proceedings of the VLDB Endowment *, vol 10, no. 11, pp. 2061.

Kunft, A., Katsifodimos, A., Schelter, S., Rabl, T., & Markl, V. (2017). BlockJoin: Efficient Matrix Partitioning Through Joins. *Proceedings of the VLDB Endowment *, *10*(11), 2061.

Kunft A, Katsifodimos A, Schelter S, Rabl T, Markl V. BlockJoin: Efficient Matrix Partitioning Through Joins. Proceedings of the VLDB Endowment . 2017 Sep;10(11):2061.

@article{3eafcbe9a7fe400d9c59918707e10cd2,

title = "BlockJoin: Efficient Matrix Partitioning Through Joins",

author = "Andreas Kunft and Asterios Katsifodimos and Sebastian Schelter and Tilmann Rabl and Volker Markl",

year = "2017",

month = "9",

volume = "10",

pages = "2061",

journal = "Proceedings of the VLDB Endowment",

issn = "2150-8097",

publisher = "VLDB Endowment",

number = "11",

}

TY - JOUR

T1 - BlockJoin: Efficient Matrix Partitioning Through Joins

AU - Kunft,Andreas

AU - Katsifodimos,Asterios

AU - Schelter,Sebastian

AU - Rabl,Tilmann

AU - Markl,Volker

PY - 2017/9

Y1 - 2017/9

N2 - Linear algebra operations are at the core of many Machine Learning (ML) programs. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-to- end ML pipelines often consist of (i) relational operators used for joining the input data, (ii) user defined functions used for feature extraction and vectorization, and (iii) linear algebra operators used for model training and cross- validation. Often, these pipelines need to scale out to large datasets. In this case, these pipelines are usually implemented on top of dataflow engines like Hadoop, Spark, or Flink. These dataflow engines implement relational operators on row-partitioned datasets. However, efficient linear algebra operators use block-partitioned matrices. As a result, pipelines combining both kinds of operators require rather expensive changes to the physical representation, in particular re partitioning steps. In this paper, we investigate the potential of reducing shuffling costs by fusing relational and linear algebra operations into specialized physical operators. We present BlockJoin, a distributed join algorithm which directly produces block-partitioned results. To minimize shuffling costs, BlockJoin applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines. Our experimental evaluation shows speedups up to 6× and the skew resistance of BlockJoin compared to state- of-the-art pipelines implemented in Spark.

AB - Linear algebra operations are at the core of many Machine Learning (ML) programs. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-to- end ML pipelines often consist of (i) relational operators used for joining the input data, (ii) user defined functions used for feature extraction and vectorization, and (iii) linear algebra operators used for model training and cross- validation. Often, these pipelines need to scale out to large datasets. In this case, these pipelines are usually implemented on top of dataflow engines like Hadoop, Spark, or Flink. These dataflow engines implement relational operators on row-partitioned datasets. However, efficient linear algebra operators use block-partitioned matrices. As a result, pipelines combining both kinds of operators require rather expensive changes to the physical representation, in particular re partitioning steps. In this paper, we investigate the potential of reducing shuffling costs by fusing relational and linear algebra operations into specialized physical operators. We present BlockJoin, a distributed join algorithm which directly produces block-partitioned results. To minimize shuffling costs, BlockJoin applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines. Our experimental evaluation shows speedups up to 6× and the skew resistance of BlockJoin compared to state- of-the-art pipelines implemented in Spark.

M3 - Article

VL - 10

SP - 2061

JO - Proceedings of the VLDB Endowment

T2 - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

SN - 2150-8097

IS - 11

ER -

ID: 36128798