Boosting Efficiency of External Pipelines by Blurring Application Boundaries
Abstract
Modern application development addresses increasingly specialized problems using domain-specific utilities, such as Optical Code Recognition and standalone statistical tools. The diversity of tooling, combined with the ever-growing volume of data, requires data pipelines to be both efficient and support a variety of data processing tools within the same pipeline. Existing approaches, however, impose a tradeoff between modularity and performance: on the one hand, data processing systems are specialized for fast execution of complex queries, favoring efficiency at the expense of high development costs and required domain expertise. On the other hand, highly extensible systems opt for composability at the expense of inefficient execution due to minimal assumptions about input and output formats. This paper proposes Generalized OLAP (GOLAP), a new DBMS paradigm that places automatic extensibility of functionality as a first-class design goal. GOLAP ingests external utilities to achieve the functionality provided by external modular data pipelines while maintaining the performance of natively optimized DBMS functions. Through a combination of runtime inspection and static analysis, GOLAP detects inter-utility communication inefficiencies and parallelization opportunities beyond the limits of isolated utility optimizations. It then modifies the utilities to elide unnecessary inter-utility operations and parallelizes the pipeline to increase hardware utilization. To evaluate GOLAP, we build Caesar, a prototype that optimizes simple pipelines, showing up to 22x speedup while introducing a limited instrumentation period with a slowdown of less than 17%.