Chukonu: A Fully-Featured High-Performance Big Data Framework that Integrates a Native Compute Engine into Spark

Authors:

Bowen Yu (Tsinghua University)* Guanyu Feng (Tsinghua University) Huanqi Cao (Tsinghua University) Xiaohan Li (Tsinghua University) Zhenbo Sun (Tsinghua University) Haojie Wang (Tsinghua University) Xiaowei Zhu (Tsinghua University) Weimin Zheng (Tsinghua university) Wenguang Chen (Tsinghua University)

Download PDF

Abstract

Apache Spark is a widely deployed big data analytics framework that offers such attractive features as resiliency, load-balancing, and a rich ecosystem. However, there is still plenty of room for improvement in its performance. Although a data-parallel system in a native programming language significantly improves performance, it may require re-implementing many functionalities of Spark to become a full-featured system. It is desirable for native big data systems to just write a compute engine in native languages to ensure high efficiency, and reuse other mature features provided by Spark rather than re-implement everything. But the interaction between the JVM and the native world risks becoming a bottleneck. This paper proposes Chukonu, a native big data framework that re-uses critical big data features provided by Spark. Owing to our novel DAG-splitting approach, the potential Spark integration overhead is alleviated, and its even outperforms existing pure native big data frameworks. Chukonu splits DAG programs into run-time parts and compile-time parts: The run-time parts are delegated to Spark to offload the complexities due to feature implementations. The compile-time parts are natively compiled. We propose a series of optimization techniques to be applied to the compile-time parts, such as operator fusion, vectorization, and compaction, to significantly reduce the Spark integration overhead. The results of evaluation show that Chukonu has a speedup of up to 71.58× (geometric mean 6.09×) over Apache Spark, and up to 7.20× (geometric mean 2.30×) over pure-native frameworks on six commonly-used big data applications. By translating the physical plan produced by SparkSQL into Chukonu programs, Chukonu accelerates SparkSQL’s TPC-DS performance by 2.29×.

PVLDB is part of the VLDB Endowment Inc.

Start

Current Submission

All Volumes

Reproducibility

General Information

Volume 15, No. 4

Chukonu: A Fully-Featured High-Performance Big Data Framework that Integrates a Native Compute Engine into Spark

Abstract