go back
go back
Volume 15, No. 10
Optimizing Inference Serving on Serverless Platforms
Abstract
Serverless computing is an emerging cloud paradigm that implements a pay-per-use cost model and releases users from the burden of managing virtual resources. This becomes tremendously attractive for machine learning (ML) inference serving as it makes autonomous resource scaling robust and easy to use, especially when workloads are bursty. Existing serverless platforms work well for image-based ML inference serving, where requests are homogeneous in service demands. That said, recent advances in natural language processing could not fully benefit from existing serverless platforms as their requests are intrinsically heterogeneous. Batching requests for processing can significantly increase ML serving efficiency while reducing monetary cost, thanks to the pay-per-use pricing model adopted by serverless platforms. Yet, batching heterogeneous ML requests leads to additional computation overhead as small requests need to be "padded" to the same size as large requests within the same batch. Reaching effective batching decisions (i.e., which requests should be batched together and why) is non-trivial: the padding overhead coupled with the serverless auto-scaling forms a complex optimization problem. To address this, we develop Multi-Buffer Serving (MBS), a framework that optimizes the batching of heterogeneous ML inference serving requests to minimize their monetary cost while meeting their service level objectives (SLOs). The core of MBS is a performance and cost estimator driven by analytical models supercharged by a Bayesian optimizer. MBS is prototyped and evaluated on AWS using bursty workloads. Experimental results show that MBS preserves SLOs while outperforming the state-of-the-art by up to 8 x in terms of cost savings while minimizing the padding overhead by up to 37 x with 3 less number of serverless function invocations.
PVLDB is part of the VLDB Endowment Inc.
Privacy Policy