DataLoom: Simplifying Data Loading with LLMs

Authors:

Alexander Van Renen, Mihail Stoian, Andreas Kipf

Download PDF

Abstract

Schema discovery and data loading is a crucial step in any data analysis pipeline. While this used to be a rare task, in the highly dynamic field of machine learning and modern business intelligence on top of data lakes, today it has become a frequent, but often underestimated, activity. Existing tools often focus on single files, presume prior knowledge of the data on the user’s side or a significant amount of manual labor. In this paper, we improve the process of mapping a “chaotic” set of files to an initial database schema that can then be iteratively refined and loaded. The idea is to take the previously tedious parts of this process and automate them through the use of Large Language Models (LLMs) while leaving already well-understood problems such as constraint discovery to existing algorithms. We thus carefully orchestrate the use of LLMs for the “soft” problems and traditional algorithms for the “hard” problems. This creates a more seamless schema discovery and data loading experience that minimizes the time to first insight for users. We show this vision on modern schema discovery and data loading in our web-based prototype called DataLoom that serves as our demonstration.

PVLDB is part of the VLDB Endowment Inc.

Start

Current Submission

All Volumes

Reproducibility

General Information

Volume 17, No. 12

DataLoom: Simplifying Data Loading with LLMs

Abstract