Every Row Counts: Combining Sketches and Sampling for Accurate Group-By Result Estimates
Abstract
Database systems heavily rely upon cardinality estimates for finding efficient execution plans, and estimation errors can easily affect query execution times by large factors. One particularly difficult problem is estimating the result size of a group-by operator, or, in general, the number of distinct combinations of a set of attributes. In contrast to, e. g., estimating the selectivity of simple filter predicates, the resulting number of groups cannot be predicted reliably without examining the complete input. As a consequence, most existing systems have poor estimates for the number of distinct groups. However, scanning entire relations at optimization time is not feasible in practice. Also, precise group counts cannot be precomputed for every possible combination of attributes. For practical purposes, a cheap mechanism is thus required which can handle arbitrary attribute combinations efficiently and with high accuracy. In this work, we present a novel estimation framework that combines sketched full information over individual columns with random sampling to correct for correlation bias between attributes. This combination can estimate group counts for individual columns nearly perfectly, and for arbitrary column combinations with high accuracy. Extensive experiments show that these excellent results hold for both synthetic and real-world data sets. We demonstrate how this mechanism can be integrated into existing systems with low overhead, and how estimation time can be kept negligible by means of an efficient algorithm for sample scans.