go back

Volume 18, No. 3

GraphAr: An Efficient Storage Scheme for Graph Data in Data Lakes

Authors:
Xue Li, Weibin Zeng, Zhibin Wang, Diwen Zhu, Jingbo Xu, Wenyuan Yu, Jingren Zhou

Abstract

Data lakes, increasingly adopted for their ability to store and ana-Data lakes, increasingly adopted for their ability to store and analyze diverse types of data, commonly use columnar storage formats like Parquet and ORC for handling relational tables. However, these traditional setups fall short when it comes to efficiently managing graph data, particularly those conforming to the Labeled Property Graph (LPG) model. To address this gap, this paper introduces GraphAr , a specialized storage scheme designed to enhance exist-GraphAr , a specialized storage scheme designed to enhance existing data lakes for efficient graph data management. Leveraging the strengths of Parquet, GraphAr captures LPG semantics pre-the strengths of Parquet, GraphAr captures LPG semantics precisely and facilitates graph-specific operations such as neighbor retrieval and label filtering. Through innovative data organization, encoding, and decoding techniques, GraphAr dramatically improves performance. Our evaluations reveal that GraphAr outperforms conventional Parquet and Acero-based methods, achieving an aver-conventional Parquet and Acero-based methods, achieving an average speedup of 4452 × for neighbor retrieval, 14 . 8 × for label filter-age speedup of 4452 × for neighbor retrieval, 14 . 8 × for label filtering, and 29 . 5 × for end-to-end workloads. These findings highlight GraphAr ’s potential to extend the utility of data lakes by enabling efficient graph data management.

PVLDB is part of the VLDB Endowment Inc.

Privacy Policy