Representative Time Series Discovery for Data Exploration

Authors:

Ge Lee, Shixun Huang, Zhifeng Bao, Yanchang Zhao

Download PDF

Abstract

In this work, we address the critical task of discovering representa-In this work, we address the critical task of discovering representative time series in exploratory data mining. We define a representa-tive time series in exploratory data mining. We define a representative time series, referred to as similarity-bounded representative time series, as one that represents other time series if their similarity meets a user-defined threshold. Building on this definition, we study the problem of finding the smallest set of such time series that can represent a specified proportion of all time series within the dataset. The representativeness of each similarity-bounded representative time series is controllable and determined by the specified level of similarity, and only the minimum number of such representatives needed to collectively represent the specified proportion of entire set are identified. Identifying representative time series over large-set are identified. Identifying representative time series over largescale data in an efficient and effective manner facilitates exploratory data analysis and summary generation, serving a wide range of data exploration applications across diverse domains. We first prove the NP-hardness of this problem and propose a range of approxima-NP-hardness of this problem and propose a range of approximation methods with theoretical guarantees, and we refer to them as non-learning-based methods. While effective, these methods often excel in either running time or memory efficiency, but not both concurrently. To overcome these limitations, we further propose a learning-based method that simultaneously optimizes both time and memory efficiency. This method leverages novel data prepara-and memory efficiency. This method leverages novel data preparation and training strategies, providing adaptability to user-specified representativeness requirements with low memory usage and com-representativeness requirements with low memory usage and computational overhead. We conduct extensive experiments across four real-world datasets to demonstrate that our learning-based method is highly competitive with non-learning-based methods in terms of effectiveness (produces similar number of representative time series), while achieving significantly higher efficiency (up to 21 × speedups) and lower memory consumption (saving up to 101 × memory space).

PVLDB is part of the VLDB Endowment Inc.

Start

Current Submission

All Volumes

Reproducibility

General Information

Volume 18, No. 3

Representative Time Series Discovery for Data Exploration

Abstract