go back

Volume 15, No. 6

PACk: An Efficient Partition-based Distributed Agglomerative Hierarchical Clustering Algorithm for Deduplication

Authors:
Yue Wang (Microsoft)* Vivek Narasayya (Microsoft) Yeye He (Microsoft Research) Surajit Chaudhuri (Microsoft)

Abstract

The Agglomerative Hierarchical Clustering (AHC) algorithm is widely used in numerous real-world applications. As data volumes continue to grow, efficient scale-out solutions for AHC are becoming increasingly important. In this paper, we propose a Partition-based Distributed Agglomerative Hierarchical Clustering (PACk) algorithm using novel distance-based partitioning and distance-aware merging algorithms. We develop an efficient implementation on Spark that can cluster over 250 million records in 40 minutes using only 16 commodity VMs. Compared to the state-of-the-art distributed AHC algorithm, PACk achieves 2× to 19× speedup across a variety of synthetic and real-world datasets.

PVLDB is part of the VLDB Endowment Inc.

Privacy Policy