ACM SIGMOD Anthology VLDB dblp.uni-trier.de

Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering.

Alexander Hinneburg, Daniel A. Keim: Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering. VLDB 1999: 506-517
@inproceedings{DBLP:conf/vldb/KeimH99,
  author    = {Alexander Hinneburg and
               Daniel A. Keim},
  editor    = {Malcolm P. Atkinson and
               Maria E. Orlowska and
               Patrick Valduriez and
               Stanley B. Zdonik and
               Michael L. Brodie},
  title     = {Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality
               in High-Dimensional Clustering},
  booktitle = {VLDB'99, Proceedings of 25th International Conference on Very
               Large Data Bases, September 7-10, 1999, Edinburgh, Scotland,
               UK},
  publisher = {Morgan Kaufmann},
  year      = {1999},
  isbn      = {1-55860-615-7},
  pages     = {506-517},
  ee        = {db/conf/vldb/KeimH99.html},
  crossref  = {DBLP:conf/vldb/99},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

Abstract

Many applications require the clustering of large amounts of high-dimensional data. Most clustering algorithms, however, do not work effectively and efficiently in high-dimensional space, which is due to the so-called ``curse of dimensionality''. In addition, the high-dimensional data often contains a significant amount of noise which causes additional effectiveness problems. In this paper, we review and compare the existing algorithms for clustering high-dimensional data and show the impact of the curse of dimensionality on their effectiveness and efficiency. The comparison reveals that condensation-based approaches (such as BIRCH or STING) are the most promising candidates for achieving the necessary efficiency, but it also shows that basically all condension-based approaches have severe weaknesses with respect to their effectiveness in high-dimensional space. To overcome these problems, we develop a new clustering technique called OptiGrid which is based on constructing an optimal grid-partitioning of the data. The optimal grid-partitioning is determined by calculating the best partitioning hyperplanes for each dimension (if such a partitioning exists) using certain projections of the data. The advantages of our new approach are (1) it has a firm mathematical basis (2) it is by far more effective than existing clustering algorithms for high-dimensional data (3) it is very efficient even for large data sets of high dimensionality. To demonstrate the effectiveness and efficiency of our new approach, we perform a series of experiments on a number of different data sets from CAD and molecular biology. A comparison with one of the best known algorithms (BIRCH) shows the superiority of our new approach.

Copyright © 1999 by the VLDB Endowment. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by the permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment.


Online Paper

DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...

Printed Edition

Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, Michael L. Brodie (Eds.): VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK. Morgan Kaufmann 1999, ISBN 1-55860-615-7
Contents CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML

References

[AGG+98]
Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. SIGMOD Conference 1998: 94-105 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[BGRS99]
Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft: When Is ''Nearest Neighbor'' Meaningful? ICDT 1999: 217-235 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[DJSGM97]
...
[EKSX96]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD 1996: 226-231 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[DJSGM97]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu: Density-Connected Sets and their Application for Trend Detection in Spatial Databases. KDD 1997: 10-15 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[EKX95]
Martin Ester, Hans-Peter Kriegel, Xiaowei Xu: Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification. SSD 1995: 67-82 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[FL95]
Christos Faloutsos, King-Ip Lin: FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. SIGMOD Conference 1995: 163-174 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[FH75]
...
[HK98]
Alexander Hinneburg, Daniel A. Keim: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. KDD 1998: 58-65 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[HK99]
...
[Hub85]
...
[Jag91]
H. V. Jagadish: A Retrieval Technique for Similar Shapes. SIGMOD Conference 1991: 208-217 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Kuk92]
Karen Kukich: Techniques for Automatically Correcting Words in Text. ACM Comput. Surv. 24(4): 377-439(1992) CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[MG95]
Rajiv Mehrotra, James E. Gary: Feature-Index-Based Similar Shape Retrieval. VDB 1995: 46-65 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[MN89]
Kurt Mehlhorn, Stefan Näher: LEDA: A Library of Efficient Data Types and Algorithms. MFCS 1989: 88-106 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[NH94]
Raymond T. Ng, Jiawei Han: Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB 1994: 144-155 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[Sch96]
...
[Sco92]
...
[Sil86]
...
[SCZ98]
Gholamhosein Sheikholeslami, Surojit Chatterjee, Aidong Zhang: WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases. VLDB 1998: 428-439 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[SH94]
...
[WW80]
...
[WYM97]
Wei Wang, Jiong Yang, Richard R. Muntz: STING: A Statistical Information Grid Approach to Spatial Data Mining. VLDB 1997: 186-195 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[XEKS98]
Xiaowei Xu, Martin Ester, Hans-Peter Kriegel, Jörg Sander: A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases. ICDE 1998: 324-331 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML
[ZRL96]
Tian Zhang, Raghu Ramakrishnan, Miron Livny: BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD Conference 1996: 103-114 CiteSeerX Google scholar pubzone.org BibTeX bibliographical record in XML

Copyright © Tue Mar 16 02:22:08 2010 by Michael Ley (ley@uni-trier.de)