Scalable Data Mining and Archiving for the Square Kilometre Array
Abstract
As the technologies for remote observation improve, the rapid increase in the frequency and fidelity of those observations translates into an avalanche of data that is already beginning to eclipse the resources, both human and technical, of the institutions and facilities charged with managing the information. Common data management tasks like cataloging both data itself and contextual meta-data, creating and maintaining scalable permanent archive, and making data available on-demand for research present significant software engineering challenges when considered at the scales of modern multi-national scientific enterprises such as the upcoming Square Kilometre Array project. The NASA Jet Propulsion Laboratory (JPL), leveraging internal research and technology development funding, has begun to explore ways to address the data archiving and distribution challenges with a number of parallel activities involving collaborations with the EVLA and ALMA teams at the National Radio Astronomy Observatory (NRAO), and members of the Square Kilometre Array South Africa team. To date, we have leveraged the Apache OODT Process Control System framework and its catalog and archive service components that provide file management, workflow management, resource management as core web services. A client crawler framework ingests upstream data (e.g., EVLA raw directory output), identifies its MIME type and automatically extracts relevant metadata including temporal bounds, and job-relevant/processing information. A remote content acquisition (pushpull) service is responsible for staging remote content and handing it off to the crawler framework. A science algorithm wrapper (called CAS-PGE) wraps underlying code including CASApy programs for the EVLA, such as Continuum Imaging and Spectral Line Cube generation, executes the algorithm, and ingests its output (along with relevant extracted metadata). In addition to processing, the Process Control System has been leveraged to provide data curation and automatic ingestion for the MeerKAT/KAT-7 precursor instrument in South Africa, helping to catalog and archive correlator and sensor output from KAT-7, and to make the information available for downstream science analysis. These efforts, supported by the increasing availability of high-quality open source software, represent a concerted effort to seek a cost-conscious methodology for maintaining the integrity of observational data from the upstream instrument to the archive, and at the same time ensuring that the data, with its richly annotated catalog of meta-data, remains a viable resource for research into the future.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2011
- Bibcode:
- 2011AGUFMIN23B1456J
- Keywords:
-
- 1914 INFORMATICS / Data mining;
- 7594 SOLAR PHYSICS;
- ASTROPHYSICS;
- AND ASTRONOMY / Instruments and techniques