DOI

Management of large hydrologic datasets including storage, structuring, clustering, indexing, and query is one of the crucial challenges in the era of big data. This research originates from a specific problem: time series extraction at specific locations takes a long time when a large multidimensional (MD) dataset is stored in the NetCDF classic or the 64-bit offset format. The essence of this issue lies in the contiguous storage structure adopted by NetCDF. In this research, NetCDF file-based solutions and a MD array database management system applying a chunked storage structure are benchmarked to determine the best solution for storing and querying large MD hydrologic datasets. Expert consultancy was conducted to establish benchmark sets, with the HydroNET-4 system being utilized to provide the benchmark environment. In the final benchmark tests, the effect of data storage configurations, elaborating chunk size, dimension order (spatio-temporal clustering) and compression on the query performance, is explored. Results indicate that for big hydrologic MD data management, the properly chunked NetCDF-4 solution without compression is, in general, more efficient than the SciDB DBMS. However, benefits of a DBMS should not be neglected, for example, the integration with other data types, smart caching strategies, transaction support, scalability, and out-of-The-box support for parallelization.

Original languageEnglish
Pages (from-to)1058-1070
Number of pages13
JournalJournal of Hydroinformatics
Volume20
Issue number5
DOIs
Publication statusPublished - 2018

    Research areas

  • Chunked storage structure, Hydrologic dataset, NetCDF, SciDB

ID: 51427409