go back

Volume 15, No. 4

New Query Optimization Techniques in the Spark Engine of Azure Synapse

Authors:
Abhishek Modi (Microsoft) Kaushik Rajan (Microsoft Research)* Srinivas Thimmaiah (Microsoft) Prakhar Jain (Databricks) Swinky Mann (Microsoft) Ayushi Agarwal (Microsoft) Ajith Shetty (Microsoft) Shahid K I (Microsoft) Ashit Gosalia (Microsoft) Partho Sarthi (Microsoft Research)

Abstract

The cost of big-data query execution is dominated by stateful operators. These include sort and hash-aggregate that typically materialize intermediate data in memory, and exchange that materializes data to disk and transfers data over the network. In this paper we focus on several query optimization techniques that reduce the cost of these operators. First, we introduce a novel exchange placement algorithm that simultaneously minimizes the number of exchanges required and maximizes computation reuse. It combines information about partitioning requirements of operators, with the possibility of sub-computation reuse via a multi-consumer exchange, to determine the best placement of exchanges. Second, we incorporate several partial push-down optimizations that push down computation derived from existing operators (join, group-by, intersection) below these stateful operators. We ensure that such push-downs by themselves do not introduce additional data exchanges. Finally we propose peephole optimizations to specialize the implementation of stateful operators based on their input parameters. All our optimizations are implemented in the spark engine that powers Azure Synapse. We evaluate their impact on TPCDS and demonstrate that they make our engine 1.8x faster than Apache Spark 3.0.1.

PVLDB is part of the VLDB Endowment Inc.

Privacy Policy