alpha to omega: the G(r)eek Alphabet of Sampling
Abstract
Sampling is the most versatile and easiest to implement cardinality estimation method. Therefore, it is implemented in almost every database management system, commercial or not. Consequently, the main purpose of this paper is to provide the reader with an intuition about sampling precision. In the context of query optimization, the basic procedure can be described as follows. From a relation R containing n tuples, a sample of m < n tuples is drawn. Then, a query predicate p is evaluated on the m sample tuples, and the number k of qualifying sample tuples is recorded. Assume the evaluation of the same predicate p on the relation R results in l qualifying tuples. The task now is to produce an estimate lˆ for l where n, m, k are given. The standard answer to this task is lˆ = k n m . However, there are some (yet unanswered) fundamental questions: 1. Is the standard estimator the best way to derive an estimate? 2. What are the upper and lower bounds for l? 3. How can we derive an estimate that minimizes the qerror? 4. How large is the q-error we can expect for this estimate? 5. For a given maximal allowed q-error, which sample size m should we choose? Since sampling is a probabilistic process, we will give probabilistic answers to these questions. Further, we show how result cardinality estimates for selections and joins can significantly be improved.