TY - JOUR
T1 - An Intermediate Representation for Optimizing Machine Learning Pipelines
AU - Kunft, Andreas
AU - Katsifodimos, Asterios
AU - Schelter, Sebastian
AU - Bress, Sebastian
AU - Rabl, Tilmann
AU - Markl, Volker
PY - 2019/7
Y1 - 2019/7
N2 - Machine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses iterations and linear algebra. Current systems are tailored to either of the two. As a consequence, preprocessing and ML steps are optimized in isolation. To enable holistic optimization of ML training pipelines, we present Lara, a declarative domainspecific language for collections and matrices. Lara's intermediate representation (IR) re ects on the complete program, i.e., UDFs, control ow, and both data types. Two views on the IR enable diverse optimizations. Monads enable operator pushdown and fusion across type and loop boundaries. Combinators provide the semantics of domainspecific operators and optimize data access and cross-validation of ML algorithms. Our experiments on preprocessing pipelines and selected ML algorithms show the effects of our proposed optimizations on dense and sparse data, which achieve speedups of up to an order of magnitude.
AB - Machine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses iterations and linear algebra. Current systems are tailored to either of the two. As a consequence, preprocessing and ML steps are optimized in isolation. To enable holistic optimization of ML training pipelines, we present Lara, a declarative domainspecific language for collections and matrices. Lara's intermediate representation (IR) re ects on the complete program, i.e., UDFs, control ow, and both data types. Two views on the IR enable diverse optimizations. Monads enable operator pushdown and fusion across type and loop boundaries. Combinators provide the semantics of domainspecific operators and optimize data access and cross-validation of ML algorithms. Our experiments on preprocessing pipelines and selected ML algorithms show the effects of our proposed optimizations on dense and sparse data, which achieve speedups of up to an order of magnitude.
UR - http://www.scopus.com/inward/record.url?scp=85082818643&partnerID=8YFLogxK
U2 - 10.14778/3342263.3342633
DO - 10.14778/3342263.3342633
M3 - Conference article
SN - 2150-8097
VL - 12
SP - 1553
EP - 1567
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
IS - 11
ER -