Data as a language: A novel approach to data integration

Christos Koutras*

*Corresponding author for this work

Research output: Contribution to conferenceAbstractScientific

60 Downloads (Pure)

Abstract

In modern enterprises, both operational and organizational data is typically spread across multiple heterogeneous systems, databases and file systems. Recognizing the value of their data assets, companies and institutions construct data lakes, storing disparate datasets from dierent departments and systems. However, for those datasets to become useful, they need to be cleaned and integrated. Data can be well documented, structured and encoded in dierent schemata, but also unstructured with implicit, human-understandable semantics. Due to the sheer scale of the data itself but also the multitude of representations and schemata, data integration techniques need to scale without relying heavily on human labor. Existing integration approaches fail to address hidden semantics without human input or some form of ontology, making large scale integration a daunting task. The goal of my doctoral work is to devise scalable data integration methods, employing modern machine learning to exploit semantics and facilitate discovery of novel relationship types. In order to capture semantics with minimal human intervention, we propose a new approach which we call Data as a Language (DaaL). By leveraging embeddings from the Natural Language Processing (NLP) literature, DaaL aims at extracting semantics from structured and semi-structured data, allowing the exploration of relevance and similarity among dierent data sources. This paper discusses existing data integration mechanisms and elaborates on how NLP techniques can be used in data integration, alongside challenges and research directions.

Original languageEnglish
Number of pages4
Publication statusPublished - 2019
Event2019 International Conference on Very Large Database PhD Workshop, VLDB-PhD 2019 - Los Angeles, United States
Duration: 26 Aug 201930 Aug 2019

Conference

Conference2019 International Conference on Very Large Database PhD Workshop, VLDB-PhD 2019
Country/TerritoryUnited States
CityLos Angeles
Period26/08/1930/08/19

Fingerprint

Dive into the research topics of 'Data as a language: A novel approach to data integration'. Together they form a unique fingerprint.

Cite this