Searching metadata stored in self-describing file formats efficiently
Abstract
Self-describing file formats, such as HDF5 and netCDF, allow storing metadata to describe data. These file formats are used by a large number of science applications to store enormous amounts of data. However, searching for a data object (i.e., variable) using the metadata in these self-describing file formats is often performed by extracting the metadata from data files, storing them in a database, and performing SQL-like queries. This defeats the purpose of self-describing file formats, where metadata is stored with the data and any modifications to the metadata in the files must be updated in the database used for searching. To search the metadata directly in self-describing files, we will present metadata search algorithms and tools that can work directly with self-describing metadata in I/O libraries. We will present an evaluation of existing metadata indexing algorithms, an efficient hybrid indexing algorithm called MIQS (metadata indexing and querying for self-describing metadata), and an application of MIQS in HDF5, where we show several orders of magnitude performance improvement compared to querying with database management systems. We will also present a distributed hash tree-based indexing method applied in an object-centric data management system, called Proactive Data Containers (PDC). These efficient indexing strategies offer a path towards autonomous data management systems that can hide the complexity of manually managing large numbers of files in datasets. We will discuss our strategies of querying for desired data objects without the need for knowing where data is stored and how to move the data between multiple file systems or layers of memory and storage between the data source and destination applications.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2020
- Bibcode:
- 2020AGUFMH159...03B
- Keywords:
-
- 1855 Remote sensing;
- HYDROLOGY;
- 1906 Computational models;
- algorithms;
- INFORMATICS;
- 1908 Cyberinfrastructure;
- INFORMATICS;
- 1916 Data and information discovery;
- INFORMATICS