Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering.
Alexander Hinneburg, Daniel A. Keim:
Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering.
VLDB 1999: 506-517@inproceedings{DBLP:conf/vldb/KeimH99,
author = {Alexander Hinneburg and
Daniel A. Keim},
editor = {Malcolm P. Atkinson and
Maria E. Orlowska and
Patrick Valduriez and
Stanley B. Zdonik and
Michael L. Brodie},
title = {Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality
in High-Dimensional Clustering},
booktitle = {VLDB'99, Proceedings of 25th International Conference on Very
Large Data Bases, September 7-10, 1999, Edinburgh, Scotland,
UK},
publisher = {Morgan Kaufmann},
year = {1999},
isbn = {1-55860-615-7},
pages = {506-517},
ee = {db/conf/vldb/KeimH99.html},
crossref = {DBLP:conf/vldb/99},
bibsource = {DBLP, http://dblp.uni-trier.de}
}
Abstract
Many applications require the clustering of large amounts
of high-dimensional data. Most clustering algorithms,
however, do not work effectively and efficiently in high-dimensional
space, which is due to the so-called ``curse of dimensionality''.
In addition, the high-dimensional data
often contains a significant amount of noise which causes
additional effectiveness problems. In this paper, we review
and compare the existing algorithms for clustering high-dimensional
data and show the impact of the curse of dimensionality on
their effectiveness and efficiency. The comparison
reveals that condensation-based approaches (such
as BIRCH or STING) are the most promising candidates
for achieving the necessary efficiency, but it also shows
that basically all condension-based approaches have severe
weaknesses with respect to their effectiveness in high-dimensional
space. To overcome these problems, we develop a new clustering
technique called OptiGrid which is based on
constructing an optimal grid-partitioning of the data. The
optimal grid-partitioning is determined by calculating the
best partitioning hyperplanes for each dimension (if such a
partitioning exists) using certain projections of the data.
The advantages of our new approach are
(1) it has a firm mathematical basis
(2) it is by far more effective than existing clustering
algorithms for high-dimensional data
(3) it is very efficient even for large data sets of high
dimensionality.
To demonstrate the effectiveness and efficiency of our new
approach, we perform a series of experiments on a number of
different data sets from CAD and molecular biology.
A comparison with one of the best known algorithms (BIRCH)
shows the superiority of our new approach.
Copyright © 1999 by the VLDB Endowment.
Permission to copy without fee all or part of this material is granted provided that the copies are not made or
distributed for direct commercial advantage, the VLDB
copyright notice and the title of the publication and
its date appear, and notice is given that copying
is by the permission of the Very Large Data Base
Endowment. To copy otherwise, or to republish, requires
a fee and/or special permission from the Endowment.
Online Paper
DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...
Printed Edition
Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, Michael L. Brodie (Eds.):
VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK.
Morgan Kaufmann 1999, ISBN 1-55860-615-7
Contents
References
- [AGG+98]
- Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan:
Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications.
SIGMOD Conference 1998: 94-105
- [BGRS99]
- Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft:
When Is ''Nearest Neighbor'' Meaningful?
ICDT 1999: 217-235
- [DJSGM97]
- ...
- [EKSX96]
- Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu:
A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.
KDD 1996: 226-231
- [DJSGM97]
- Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu:
Density-Connected Sets and their Application for Trend Detection in Spatial Databases.
KDD 1997: 10-15
- [EKX95]
- Martin Ester, Hans-Peter Kriegel, Xiaowei Xu:
Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification.
SSD 1995: 67-82
- [FL95]
- Christos Faloutsos, King-Ip Lin:
FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets.
SIGMOD Conference 1995: 163-174
- [FH75]
- ...
- [HK98]
- Alexander Hinneburg, Daniel A. Keim:
An Efficient Approach to Clustering in Large Multimedia Databases with Noise.
KDD 1998: 58-65
- [HK99]
- ...
- [Hub85]
- ...
- [Jag91]
- H. V. Jagadish:
A Retrieval Technique for Similar Shapes.
SIGMOD Conference 1991: 208-217
- [Kuk92]
- Karen Kukich:
Techniques for Automatically Correcting Words in Text.
ACM Comput. Surv. 24(4): 377-439(1992)
- [MG95]
- Rajiv Mehrotra, James E. Gary:
Feature-Index-Based Similar Shape Retrieval.
VDB 1995: 46-65
- [MN89]
- Kurt Mehlhorn, Stefan Näher:
LEDA: A Library of Efficient Data Types and Algorithms.
MFCS 1989: 88-106
- [NH94]
- Raymond T. Ng, Jiawei Han:
Efficient and Effective Clustering Methods for Spatial Data Mining.
VLDB 1994: 144-155
- [Sch96]
- ...
- [Sco92]
- ...
- [Sil86]
- ...
- [SCZ98]
- Gholamhosein Sheikholeslami, Surojit Chatterjee, Aidong Zhang:
WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases.
VLDB 1998: 428-439
- [SH94]
- ...
- [WW80]
- ...
- [WYM97]
- Wei Wang, Jiong Yang, Richard R. Muntz:
STING: A Statistical Information Grid Approach to Spatial Data Mining.
VLDB 1997: 186-195
- [XEKS98]
- Xiaowei Xu, Martin Ester, Hans-Peter Kriegel, Jörg Sander:
A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases.
ICDE 1998: 324-331
- [ZRL96]
- Tian Zhang, Raghu Ramakrishnan, Miron Livny:
BIRCH: An Efficient Data Clustering Method for Very Large Databases.
SIGMOD Conference 1996: 103-114
Copyright © Tue Mar 16 02:22:08 2010
by Michael Ley (ley@uni-trier.de)