We describe how we achieve super-linear speedup over standard approaches for oceanographic analytics on a cluster computer and the Amazon Web Services (AWS) cloud. NEXUS is an open source platform for big data analytics in the cloud that enables this performance through a combination of horizontally scalable data parallelism with Apache Spark and rapid data search, subset, and retrieval with tiled array storage in cloud-aware NoSQL databases like Solr and Cassandra. NEXUS is the engine behind several public portals at NASA and OceanWorks is a newly funded project for the ocean community that will mature and extend this capability for improved data discovery, subset, quality screening, analysis, matchup of satellite and in situ measurements, and visualization. We review the Python language API for Spark and how to use it to quickly convert existing programs to use Spark to run with cloud-scale parallelism, and discuss strategies to improve performance. We explain how partitioning the data over space, time, or both leads to algorithmic design patterns for Spark analytics that can be applied to many different algorithms. We use NEXUS analytics as examples, including area-averaged time series, time averaged map, and correlation map.
AGU Fall Meeting Abstracts
- Pub Date:
- December 2017
- 1916 Data and information discovery;
- 1920 Emerging informatics technologies;
- 1932 High-performance computing;
- 1998 Workflow;