Towards an Objective Metric for Data Value Through Relevance
Abstract
The rate at which humanity is producing data has increased significantly over the last decade. As organizations generate unprecedented amounts of data, storing, cleaning, integrating, and analyzing this data consumes significant (human and computational) resources. At the same time organizations extract significant value from their data. In this work, we present our vision for developing an objective metric for the value of data based on the recently introduced concept of data relevance, outline proposals for how to efficiently compute and maintain such metrics, and how to utilize data value to improve data management including storage organization, query performance, intelligent allocation of data collection and curation efforts, improving data catalogs, and for making pricing decisions in data markets. While we mostly focus on tabular data, the concepts we introduce can also be applied to other data models such as semi-structure data (e.g., JSON) or property graphs. Furthermore, we discuss strategies for dealing with data and workloads that evolve and discuss how to deal with data that is currently not relevant, but has potential value (we refer to this as dark data). Furthermore, we sketch ideas for measuring the value that a query / workload has for an organization and reason about the interaction between query and data value.