Petabyte-Scale Row-Level Operations in Data Lakehouses

Authors:

Anton Okolnychyi, Chao Sun, Kazuyuki Tanimura, Russell Spitzer, Ryan Blue, Szehon Ho, Yufei Gu, Vishwanath Lakkundi, Db Tsai

Download PDF

Abstract

Data lakehouses combine the almost infinite scale and diverse tooling of a data lake with the reliability and functionality of a data warehouse. This paper presents extensions that enhance data lakehouses using Apache Iceberg and Apache Spark with performant petabyte-scale row-level operations. The framework is capable of handling both high-density and sparse modifications by either materializing changes at the file level during writes or producing equality and position deletes that are lazily merged with existing data during reads. The paper also outlines essential improvements in determining and applying row-level changes: eliminating expensive shuffles with storage-partitioned joins, minimizing write amplification with runtime filtering, and optimizing the layout of output data with adaptive writes. Our evaluation demonstrates the relative strengths and weaknesses of the various materialization strategies, highlighting the use cases that require each technique. We also show an order of magnitude improvement in performance after our enhancements.

PVLDB is part of the VLDB Endowment Inc.

Start

Current Submission

All Volumes

Reproducibility

General Information

Volume 17, No. 12

Petabyte-Scale Row-Level Operations in Data Lakehouses

Abstract