Data infrastructure for dynamic, modular, provenance-focused data processing for streaming sensor networks
Abstract
Streaming sensor networks such as the National Ecological Observatory Network (NEON) deploy hundreds to thousands of instruments across distributed locations. Reliably and efficiently processing the immense volume of data generated by these instruments into research-grade data products requires addressing several challenges, including dynamic data and metadata, complete data provenance, and incorporation of scientific algorithms into large-scale cyberinfrastructure. NEON is addressing these challenges by implementing a data processing system that is dynamically automated, fully version controlled, and highly modular. Dynamic automation is achieved via Pachyderm, which listens for and executes processing modules upon any data change. Any and all changes to data, metadata, processing parameters, and processing code are version controlled in Git-like fashion to achieve complete provenance. Open-source Docker-based processing modules contain all code and system dependencies, creating reproducibility and transparency. Finally, the highly modular, Docker-based code deployment allows scientists and software developers to construct various components of the data pipeline in their own computing language yet test, combine, and execute them in the same environment, thus efficiently integrating science code in a robust cyberinfrastructure. Here, we present the system design and share promising early results and lessons learned.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2019
- Bibcode:
- 2019AGUFMIN34A..03S
- Keywords:
-
- 0394 Instruments and techniques;
- ATMOSPHERIC COMPOSITION AND STRUCTURE;
- 0520 Data analysis: algorithms and implementation;
- COMPUTATIONAL GEOPHYSICS;
- 0555 Neural networks;
- fuzzy logic;
- machine learning;
- COMPUTATIONAL GEOPHYSICS;
- 0594 Instruments and techniques;
- COMPUTATIONAL GEOPHYSICS