This website is under development. If you come accross any issues, please report them to Konstantinos Kanellis
(kkanellis@cs.wisc.edu) or Yannis Chronis
(chronis@google.com).
Amalur: Next-generation Data Integration in Data Lakes
Abstract
Data science workflows often require extracting, preparing and integrating data from multiple data sources. This is a cumbersome and slow process: most of the times, data scientists prepare data in a data processing system or a data lake, and export it as a table, in order for it to be consumed by a Machine Learning (ML) algorithm. Recent advances in the area of factorized ML, allow us to push down certain linear algebra (LA) operators, executing them closer to the data sources. With this work, we revisit classic data integration (DI) systems and see how these fit into modern data lakes that are meant to support LA as a first-class citizen.
Citation
@inproceedings{cidr/2022/a85-hai,
author = {Rihan Hai and
Christos Koutras and
Andra Ionescu and
Asterios Katsifodimos},
title = {Amalur: Next-generation Data Integration in Data Lakes},
booktitle = {Proceedings of the 12th Conference on Innovative Data Systems Research, CIDR 2022},
publisher = {www.cidrdb.org},
year = {2022},
series = {CIDR 2022},
url = {https://cidr.org/temp-website/papers/2022/a85-hai.pdf},
location = {Chaminade, CA, USA}
}