Sampling-Based Estimation of the Number of Distinct Values of an Attribute.
Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, Lynne Stokes:
Sampling-Based Estimation of the Number of Distinct Values of an Attribute.
We provide several new sampling-based estimators of the number of distinctvalues of an attribute in a relation.
We compare these new estimators to estimators from the database and statistical literature empirically, using a large number of attribute-value distributions drawn from a variety of real-world databases.
This appears to be the first extensive comparison of distinct-value estimators in either the database or statistical literature, and is certainly the first to use highly- skewed data of the sort frequently encountered in database applications.
Our experiments indicate that a new "hybrid" estimator yields the highest precision on average for a given sampling fraction.
This estimator explicitly takes into account the degree of skew in the data and combines a new "smoothed jackknife" estimator with an estimator due to Shlosser.
We investigate how the hybrid estimator behaves as we scale up the size ofthe database.
Printed Edition
Umeshwar Dayal, Peter M. D. Gray, Shojiro Nishio (Eds.):
VLDB'95, Proceedings of 21th International Conference on Very Large Data Bases, September 11-15, 1995, Zurich, Switzerland.
Morgan Kaufmann 1995, ISBN 1-55860-379-4
