go back

Volume 15, No. 11

Enabling Efficient and General Subpopulation Analytics in Multidimensional Data Streams

Authors:
Antonis Manousis (Carnegie Mellon University)* zhuo cheng (Peking University) Zaoxing Liu (Boston University) Ran Ben Basat (UCL) Vyas Sekar (Carnegie Mellon University)

Abstract

Many large-scale services and infrastructures (e.g., video streaming platforms, data centers, sensor grids) need diverse real-time summary statistics (e.g., cardinality, entropy, frequency moments, norms) across multiple subpopulations of multidimensional datasets. However, state-of-art frameworks do not offer general and accurate analytics in real-time at reasonable operational cost. The root cause is the combinatorial explosion of data subpopulations coupled with the diversity of summary statistics we need to simultaneously monitor. In this work, we present Hydra, an efficient framework for multidimensional analytics that builds on two key ideas. First, it avoids the overhead of monitoring exponentially-many subpopulations with a “sketch of sketches” that summarizes data streams with sub-linear space complexity to the number of data subpopulations. Second, Hydra leverages universal sketching to ensure high-fidelity estimations for a broad set of statistics, thus making the time/space complexity independent of the number of different summary statistics. We implement a prototype of Hydra as an Apache Spark plugin and evaluate it on both real-world and synthetic multidimensional datasets. We also tackle practical system challenges to ensure low overheads and large scale. We show that Hydra can achieve robust error bounds and is an order of magnitude more efficient in terms of operational cost and memory foot- print than existing analytics engines (e.g., Spark, Druid) while ensuring interactive estimation times.

PVLDB is part of the VLDB Endowment Inc.

Privacy Policy