Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering.

Alexander Hinneburg, Daniel A. Keim: Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering. VLDB 1999: 506-517

@inproceedings{DBLP:conf/vldb/KeimH99,
  author    = {Alexander Hinneburg and
               Daniel A. Keim},
  editor    = {Malcolm P. Atkinson and
               Maria E. Orlowska and
               Patrick Valduriez and
               Stanley B. Zdonik and
               Michael L. Brodie},
  title     = {Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality
               in High-Dimensional Clustering},
  booktitle = {VLDB'99, Proceedings of 25th International Conference on Very
               Large Data Bases, September 7-10, 1999, Edinburgh, Scotland,
               UK},
  publisher = {Morgan Kaufmann},
  year      = {1999},
  isbn      = {1-55860-615-7},
  pages     = {506-517},
  ee        = {db/conf/vldb/KeimH99.html},
  crossref  = {DBLP:conf/vldb/99},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

Abstract

Many applications require the clustering of large amounts of high-dimensional data. Most clustering algorithms, however, do not work effectively and efficiently in high-dimensional space, which is due to the so-called ``curse of dimensionality''. In addition, the high-dimensional data often contains a significant amount of noise which causes additional effectiveness problems. In this paper, we review and compare the existing algorithms for clustering high-dimensional data and show the impact of the curse of dimensionality on their effectiveness and efficiency. The comparison reveals that condensation-based approaches (such as BIRCH or STING) are the most promising candidates for achieving the necessary efficiency, but it also shows that basically all condension-based approaches have severe weaknesses with respect to their effectiveness in high-dimensional space. To overcome these problems, we develop a new clustering technique called OptiGrid which is based on constructing an optimal grid-partitioning of the data. The optimal grid-partitioning is determined by calculating the best partitioning hyperplanes for each dimension (if such a partitioning exists) using certain projections of the data. The advantages of our new approach are (1) it has a firm mathematical basis (2) it is by far more effective than existing clustering algorithms for high-dimensional data (3) it is very efficient even for large data sets of high dimensionality. To demonstrate the effectiveness and efficiency of our new approach, we perform a series of experiments on a number of different data sets from CAD and molecular biology. A comparison with one of the best known algorithms (BIRCH) shows the superiority of our new approach.

Copyright © 1999 by the VLDB Endowment. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by the permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment.

Online Paper

Download PDF file (www.vldb.org, Darmstadt, Germany)
Download PDF file (www.acm.org, New York, USA)

DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...

Windows: Click the letter of your CD drive
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Mac: Click here
UNIX/LINUX: mount the DVD and click on the path of your mount point:
/Anthology/aDVD1 or /dvd

Printed Edition

Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, Michael L. Brodie (Eds.): VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK. Morgan Kaufmann 1999, ISBN 1-55860-615-7
Contents

References

[AGG+98]: Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. SIGMOD Conference 1998: 94-105
[BGRS99]: Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft: When Is ''Nearest Neighbor'' Meaningful? ICDT 1999: 217-235
[DJSGM97]: ...
[EKSX96]: Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD 1996: 226-231
[DJSGM97]: Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu: Density-Connected Sets and their Application for Trend Detection in Spatial Databases. KDD 1997: 10-15
[EKX95]: Martin Ester, Hans-Peter Kriegel, Xiaowei Xu: Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification. SSD 1995: 67-82
[FL95]: Christos Faloutsos, King-Ip Lin: FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. SIGMOD Conference 1995: 163-174
[FH75]: ...
[HK98]: Alexander Hinneburg, Daniel A. Keim: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. KDD 1998: 58-65
[HK99]: ...
[Hub85]: ...
[Jag91]: H. V. Jagadish: A Retrieval Technique for Similar Shapes. SIGMOD Conference 1991: 208-217
[Kuk92]: Karen Kukich: Techniques for Automatically Correcting Words in Text. ACM Comput. Surv. 24(4): 377-439(1992)
[MG95]: Rajiv Mehrotra, James E. Gary: Feature-Index-Based Similar Shape Retrieval. VDB 1995: 46-65
[MN89]: Kurt Mehlhorn, Stefan Näher: LEDA: A Library of Efficient Data Types and Algorithms. MFCS 1989: 88-106
[NH94]: Raymond T. Ng, Jiawei Han: Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB 1994: 144-155
[Sch96]: ...
[Sco92]: ...
[Sil86]: ...
[SCZ98]: Gholamhosein Sheikholeslami, Surojit Chatterjee, Aidong Zhang: WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases. VLDB 1998: 428-439
[SH94]: ...
[WW80]: ...
[WYM97]: Wei Wang, Jiong Yang, Richard R. Muntz: STING: A Statistical Information Grid Approach to Spatial Data Mining. VLDB 1997: 186-195
[XEKS98]: Xiaowei Xu, Martin Ester, Hans-Peter Kriegel, Jörg Sander: A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases. ICDE 1998: 324-331
[ZRL96]: Tian Zhang, Raghu Ramakrishnan, Miron Livny: BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD Conference 1996: 103-114