Constructing Flexible, Configurable, ETL Pipelines for the Analysis of "Big Data" with Apache OODT
Abstract
A plethora of open source technologies for manipulating, transforming, querying, and visualizing 'big data' have blossomed and matured in the last few years, driven in large part by recognition of the tremendous value that can be derived by leveraging data mining and visualization techniques on large data sets. One facet of many of these tools is that input data must often be prepared into a particular format (e.g.: JSON, CSV), or loaded into a particular storage technology (e.g.: HDFS) before analysis can take place. This process, commonly known as Extract-Transform-Load, or ETL, often involves multiple well-defined steps that must be executed in a particular order, and the approach taken for a particular data set is generally sensitive to the quantity and quality of the input data, as well as the structure and complexity of the desired output. When working with very large, heterogeneous, unstructured or semi-structured data sets, automating the ETL process and monitoring its progress becomes increasingly important. Apache Object Oriented Data Technology (OODT) provides a suite of complementary data management components called the Process Control System (PCS) that can be connected together to form flexible ETL pipelines as well as browser-based user interfaces for monitoring and control of ongoing operations. The lightweight, metadata driven middleware layer can be wrapped around custom ETL workflow steps, which themselves can be implemented in any language. Once configured, it facilitates communication between workflow steps and supports execution of ETL pipelines across a distributed cluster of compute resources. As participants in a DARPA-funded effort to develop open source tools for large-scale data analysis, we utilized Apache OODT to rapidly construct custom ETL pipelines for a variety of very large data sets to prepare them for analysis and visualization applications. We feel that OODT, which is free and open source software available through the Apache Software Foundation, is particularly well suited to developing and managing arbitrary large-scale ETL processes both for the simplicity and flexibility of its wrapper framework, as well as the detailed provenance information it exposes throughout the process. Our experience using OODT to manage processing of large-scale data sets in domains as diverse as radio astronomy, life sciences, and social network analysis demonstrates the flexibility of the framework, and the range of potential applications to a broad array of big data ETL challenges.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2013
- Bibcode:
- 2013AGUFMIN43C..07H
- Keywords:
-
- 1910 INFORMATICS Data assimilation;
- integration and fusion;
- 1978 INFORMATICS Software re-use;
- 1998 INFORMATICS Workflow;
- 1976 INFORMATICS Software tools and services