Cephalopod – Virtual Data Model Composition through Partial Query Translation
Abstract
Most applications have an ideal data model they should be supported by: business data by relations, social networks by graphs, messaging applications by documents and machine learning by vectors. Unfortunately, many applications need to be implemented against a “less-than-ideal” (we use the term “imposed”) data model: business data is stored in documents, learned models must process relational tuples and graphs are embedded in vectors. The textbook solution to that problem is physical integration: Extracting, Transforming and Loading data from the imposed into the ideal data model. While effective, this ETL-process is expensive and leads to staleness. Virtual integration (through query rewriting) avoids these problems but leads to a combinatorial explosion of ideal-to-imposed-model mappings. We propose to address this problem by developing a “Bridge Representation” that can be used to implement virtual integration through query translation when possible and physical integration through data transformation when necessary. In this paper, we outline the idea, study a number of guiding use cases and develop a research agenda towards such a Bridge Representation and a system that implements the approach. We also provide some preliminary results indicating that even non-bijective data-model integrations like vector embeddings can be supported at a fraction of the cost of physical integration.