ConnectorX: Accelerating Data Loading From Databases to Dataframes

Authors:

Xiaoying Wang (Simon Fraser University) Weiyuan Wu (Simon Fraser University) Jinze Wu (Simon Fraser University) Yizhou Chen (Simon Fraser University) Nick Zrymiak (Simon Fraser University) Changbo Qu (Simon Fraser University) Lampros Flokas (Columbia University) George Chow (Simon Fraser University) Jiannan Wang (Simon Fraser University)* Tianzheng Wang (Simon Fraser University) Eugene Wu (Columbia University) Qingqing Zhou (Tencent Inc.)

Download PDF

Abstract

Data is often stored in a database management system (DBMS) but dataframe libraries are widely used among data scientists. An important but challenging problem is how to bridge the gap between databases and dataframes. To solve this problem, we present ConnectorX, a client library that enables fast and memory-efficient data loading from various databases (e.g.,PostgreSQL, MySQL, SQLite, SQLServer, Oracle) to different dataframes (e.g., Pandas, PyArrow, Modin, Dask, and Polars). We first investigate why the loading process is slow and why it consumes large memory. We surprisingly find that the main overhead comes from the client-side rather than query execution and data transfer. We integrate several existing and new techniques to reduce the overhead and carefully design the system architecture and interface to make ConnectorX easy to extend to various databases and dataframes. Moreover, we propose server-side result partitioning that can be adopted by DBMSs in order to better support exporting data to data science tools. We conduct extensive experiments to evaluate ConnectorX and compare it with popular libraries. The results show that ConnectorX significantly outperforms existing solutions. ConnectorX is open sourced at: https://github.com/sfu-db/connector-x.

PVLDB is part of the VLDB Endowment Inc.

Start

Current Submission

All Volumes

Reproducibility

General Information

Volume 15, No. 11

ConnectorX: Accelerating Data Loading From Databases to Dataframes

Abstract