go back
go back
Volume 15, No. 3
MT-Teql: Evaluating and Augmenting Neural NLIDB on Real-world Linguistic and Schema Variations
Abstract
Natural Language Interface to Database (NLIDB) translates human utterances into SQL queries and enables data analytics from non-expert users. Recently, neural network models become a major approach to implement NLIDB. However, neural NLIDB faces considerable practical challenges due to the variations of natural language and database schema design. For instance, one user intent or database conceptual model can be expressed in multiple forms. However, existing benchmarks, by evaluating accuracy on hold-out datasets, cannot well expose these issues. To date, we do not have a thorough understanding of how good neural NLIDB really is in real-world situations and its robustness against linguistic and schema variations. A key difficulty is to annotate ground truth SQL queries correspond to real-world language and schema variations, which often requires considerable manual efforts and expert knowledge. To systematically assess the robustness of de facto neural NLIDB without extensively involving manual efforts, we propose MTTeql, a unified framework to benchmark NLIDB on real-world language and schema variations. Inspired by recent advances in DBMS metamorphic testing, MT-Teql delivers a model-agnostic framework which implements a comprehensive set of metamorphic relations (MRs) to conduct semantics-preserving transformations toward utterances and database schemas and generate their variants. This way, NLIDB models can be automatically assessed using utterances/schemas and their variants to determine the robustness without any manual efforts. We comprehensively benchmarked nine de facto neural NLIDB models using in total 62,430 test inputs. MT-Teql successfully identifies 15,433 defects. We categorize errors exposed by MT-Teql and analyzed potential root causes of inconsistent robustness among different neural NLIDB. We further conducted a user study to show how MT-Teql can assist developers to systematically assess NLIDBs. Furthermore, we show that the transformed (error-triggering) inputs can be leveraged to augment neural NLIDB, and boost their robustness. With input variants synthesized by MT-Teql, we successfully eliminate 46.5%(±5.0%) errors for popular neural NLIDB without compromising accuracy on standard benchmarks. In addition to nine neural NLIDB, we illustrate the generalization of MT-Teql to benchmark a popular non-neural NLIDB, NaLIR. We further discussed lessons we learned from the study that can provide insights for users to better select and design neural NLIDB that fits their particular usage scenarios.
PVLDB is part of the VLDB Endowment Inc.
Privacy Policy