Approximating Median Absolute Deviation with Bounded Error

Authors:

Zhiwei Chen (Tsinghua University), Shaoxu Song (Tsinghua University), Ziheng Wei (Huawei Technologies Co., Ltd.), Jingyun Fang (Huawei Technologies Co., Ltd.), Jiang Long (Huawei Technologies Co., Ltd.)

Download PDF

Abstract

The median absolute deviation (MAD) is a statistic measuring the variability of a set of quantitative elements. It is known to be more robust to outliers than the standard deviation (SD), and thereby widely used in outlier detection. Computing the exact MAD however is costly, e.g., by calling an algorithm of finding median twice, with space cost O(n) over n elements in a set. In this paper, we propose the first fully mergeable approximate MAD algorithm, OP-MAD, with one-pass scan of the data. Remarkably, by calling the proposed algorithm at most twice, namely TP-MAD, it guarantees to return an (e, 1)-accurate MAD, i.e., the error relative to the exact MAD is bounded by the desired e or 1. The space complexity is reduced to O(m) while the time complexity is O(n+mlogm), where m is the size of the sketch used to compress data, related to the desired error bound e. To get a more accurate MAD, i.e., with smaller e, the sketch size m will be larger, a trade-off between effectiveness and efficiency. In practice, we often have the sketch size m ≪ n, leading to constant space cost O(1) and linear time cost O(n). The extensive experiments over various datasets demonstrate the superiority of our solution, e.g., 160000× less memory and 18× faster than the aforesaid exact method in datasets pareto and norm. Finally, we further implement and evaluate the parallelizable TP-MAD in Apache Spark, and the fully mergeable OP-MAD in Structured Streaming.

PVLDB is part of the VLDB Endowment Inc.

Start

Current Submission

All Volumes

Reproducibility

General Information

Volume 14, No. 11

Approximating Median Absolute Deviation with Bounded Error

Abstract