go back
go back
Volume 15, No. 6
PACk: An Efficient Partition-based Distributed Agglomerative Hierarchical Clustering Algorithm for Deduplication
Authors:
Yue Wang (Microsoft)* Vivek Narasayya (Microsoft) Yeye He (Microsoft Research) Surajit Chaudhuri (Microsoft)
Abstract
The Agglomerative Hierarchical Clustering (AHC) algorithm is widely used in numerous real-world applications. As data volumes continue to grow, efficient scale-out solutions for AHC are becoming increasingly important. In this paper, we propose a Partition-based Distributed Agglomerative Hierarchical Clustering (PACk) algorithm using novel distance-based partitioning and distance-aware merging algorithms. We develop an efficient implementation on Spark that can cluster over 250 million records in 40 minutes using only 16 commodity VMs. Compared to the state-of-the-art distributed AHC algorithm, PACk achieves 2× to 19× speedup across a variety of synthetic and real-world datasets.
PVLDB is part of the VLDB Endowment Inc.
Privacy Policy