Symphony: Towards Natural Language Query Answering over Multi-modal Data Lakes
Abstract
Wouldn’t it be great if we could query large, diverse data lakes of tables, text, and databases as easily as using Siri or Alexa? The problem is hard from two perspectives: integrating data lakes requires data normalization/transformation, schema matching, and entity resolution and is notoriously hard, with high human cost. Even if successful, such integration efforts typically do not support arbitrary SQL queries over the integrated data set. In this paper, we propose Symphony, a novel system that enables users to easily query complex, multi-modal data lakes without performing upfront integration. For ease of use, Symphony adopts a natural language (NL) interface. To avoid integration, it employs a unified representation for multi-modal datasets, called cross-modality representation learning. When a user poses an NL query, Symphony discovers which tables or textual data should be retrieved based on the learned cross-modal representations, decomposes a complicated NL query into NL sub-queries on-demand, evaluates each sub-query on one data source and combines the results from these sub-queries. A preliminary evaluation shows that the resulting system is able to effectively answer questions over tables and text extracted from Wikipedia.