• Alexander Alexandrov
  • Lauritz Thamsen
  • Andreas Kunft
  • Odej Kao
  • Asterios Katsifodimos
  • Tobias Herb
  • Felix Schüler
  • Volker Markl

The appeal of MapReduce has spawned a family of systems that implement or extend it. In order to enable parallel collection processing with User-Defined Functions (UDFs), these systems expose extensions of the MapReduce programming model as library-based dataow APIs that are tightly coupled to their underlying runtime engine. Expressing data analysis algorithms with complex data and control ow structure using such APIs reveals a number of limitations that impede programmer's productivity. In this paper we show that the design of data analysis languages and APIs from a runtime engine point of view bloats the APIs with low-level primitives and affects programmer's productivity. Instead, we argue that an approach based on deeply embedding the APIs in a host language can address the shortcomings of current data analysis languages. To demonstrate this, we propose a language for complex data analysis embedded in Scala, which (i) allows for declarative specification of dataows and (ii) hides the notion of dataparallelism and distributed runtime behind a suitable intermediate representation. We describe a compiler pipeline that facilitates efficient data-parallel processing without imposing runtime engine-bound syntactic or semantic restrictions on the structure of the input programs. We present a series of experiments with two state-of-the-art systems that demonstrate the optimization potential of our approach.

Original languageEnglish
Title of host publicationSIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
PublisherAssociation for Computing Machinery (ACM)
Number of pages15
ISBN (Electronic)9781450327589
Publication statusPublished - 27 May 2015
EventACM SIGMOD International Conference on Management of Data, SIGMOD 2015 - Melbourne, Australia
Duration: 31 May 20154 Jun 2015


ConferenceACM SIGMOD International Conference on Management of Data, SIGMOD 2015

ID: 36129191