go back

Volume 17, No. 12

LucidScript: Bottom-up Standardization for Data Preparation

Authors:
Eugenie Y. Lai, Yuze Lou, Brit Youngmann, Michael Cafarella

Abstract

Data preparation is an essential step in every data-related effort, from scientific projects in academia to data-driven decision-making in industry. Typically, data preparation is not an interesting piece of a project — it transforms raw data into a format that enables further innovative work. Because such scripts are never intended to be interesting, are project-specific, and are written in general-purpose languages, they can be tedious to understand and difficult to verify. As a result, data preparation scripts can easily become a breeding ground for poor engineering and statistical practices. Ideally, data preparation scripts are “admirably boring” — they should serve the project, but otherwise be as simple and as standard as possible. We propose a bottom-up script standardization framework that takes a user’s data preparation script and transforms it into a simpler, more standardized version of itself. Our framework takes the user’s script not as an unchangeable definition of correctness, but as a sketch of the user’s intent. We embedded this framework in a system called LucidScript.

PVLDB is part of the VLDB Endowment Inc.

Privacy Policy