go back

Volume 18, No. 2

QueryArtisan: Generating Data Manipulation Codes for Ad-hoc Analysis in Data Lakes

Authors:
Xiu Tang, Wenhao Liu, Sai Wu, Chang Yao, Gongsheng Yuan, Shanshan Ying, Gang Chen

Abstract

Query processing over data lakes is a challenging task, often requir-Query processing over data lakes is a challenging task, often requiring extensive data pre-processing activities such as data cleaning, transformation, and loading. However, the advent of Large Lan-transformation, and loading. However, the advent of Large Language Models (LLMs) has illuminated a new pathway to address these complexities by offering a unified approach to understand-these complexities by offering a unified approach to understanding the diverse datasets submerged in data lakes. In this paper, we introduce QueryArtisan, a novel LLM-powered analytic tool specif-introduce QueryArtisan, a novel LLM-powered analytic tool specifically designed for data lakes. QueryArtisan transcends traditional ETL (Extract, Transform, Load) processes by generating just-in-ETL (Extract, Transform, Load) processes by generating just-intime code for dataset-specific queries. It eliminates the need for an intermediary schema, enabling users to query the data lake di-an intermediary schema, enabling users to query the data lake directly using natural language. To achieve this, we have developed a suite of heterogeneous operators capable of processing data across various modalities. Additionally, QueryArtisan incorporates a cost model-based query optimization technique, significantly enhanc-model-based query optimization technique, significantly enhancing its code generation capabilities for efficient query resolution. Our extensive experimental evaluations, conducted with real-life datasets, demonstrate that QueryArtisan markedly outperforms existing solutions in terms of effectiveness, efficiency and usability.

PVLDB is part of the VLDB Endowment Inc.

Privacy Policy