Dataset Relationship Management
Abstract
The database community has largely focused on providing improved transaction management and query capabilities over records (and generalizations thereof). Yet such capabilities address only a small part of today’s data science tasks, which are often much more focused on discovery, linking, comparative analysis, and collaboration across holistic datasets and data products. Data scientists frequently point to a strong need for data management — with respect to their many datasets and data products. We propose the development of the dataset relationship management system to support five main classes of operations on datasets: reuse of schema, data, curation, and work across many datasets; revelation of provenance, context, and assumptions; rapid revision of data and processing steps; system-assisted retargeting of computation to alternative execution environments; and metrics to reward individuals’ contributions to the broader data ecosystem. We argue that the recent adoption of computational notebooks (particularly JupyterLab and Jupyter Notebook), as a unified interface over data tools, provides an ideal way of gathering detailed information about how data is being used, i.e., of transparently capturing dataset provenance and relationships, and thus such notebooks provide an attractive mechanism for integrating dataset relationship management into the data science ecosystem. We briefly outline our experiences in building towards JuNEAU, the first prototype DRMS.