GraphX: Unifying DataParallel and GraphParallel Analytics
Abstract
From social networks to language modeling, the growing scale and importance of graph data has driven the development of numerous new graphparallel systems (e.g., Pregel, GraphLab). By restricting the computation that can be expressed and introducing new techniques to partition and distribute the graph, these systems can efficiently execute iterative graph algorithms orders of magnitude faster than more general dataparallel systems. However, the same restrictions that enable the performance gains also make it difficult to express many of the important stages in a typical graphanalytics pipeline: constructing the graph, modifying its structure, or expressing computation that spans multiple graphs. As a consequence, existing graph analytics pipelines compose graphparallel and dataparallel systems using external storage systems, leading to extensive data movement and complicated programming model. To address these challenges we introduce GraphX, a distributed graph computation framework that unifies graphparallel and dataparallel computation. GraphX provides a small, core set of graphparallel operators expressive enough to implement the Pregel and PowerGraph abstractions, yet simple enough to be cast in relational algebra. GraphX uses a collection of query optimization techniques such as automatic join rewrites to efficiently implement these graphparallel operators. We evaluate GraphX on realworld graphs and workloads and demonstrate that GraphX achieves comparable performance as specialized graph computation systems, while outperforming them in endtoend graph pipelines. Moreover, GraphX achieves a balance between expressiveness, performance, and ease of use.
 Publication:

arXiv eprints
 Pub Date:
 February 2014
 arXiv:
 arXiv:1402.2394
 Bibcode:
 2014arXiv1402.2394X
 Keywords:

 Computer Science  Databases