Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing
This paper presents a benchmark of stream processing throughput comparing Apache Spark Streaming (under file-, TCP socket- and Kafka-based stream integration), with a prototype P2P stream processing framework, HarmonicIO. Maximum throughput for a spectrum of stream processing loads are measured, specifically, those with large message sizes (up to 10MB), and heavy CPU loads -- more typical of scientific computing use cases (such as microscopy), than enterprise contexts. A detailed exploration of the performance characteristics with these streaming sources, under varying loads, reveals an interplay of performance trade-offs, uncovering the boundaries of good performance for each framework and streaming source integration. We compare with theoretic bounds in each case. Based on these results, we suggest which frameworks and streaming sources are likely to offer good performance for a given load. Broadly, the advantages of Spark's rich feature set comes at a cost of sensitivity to message size in particular -- common stream source integrations can perform poorly in the 1MB-10MB range. The simplicity of HarmonicIO offers more robust performance in this region, especially for raw CPU utilization.