go back
go back
Volume 17, No. 12
Towards Resource Efficiency: Practical Insights into Large-Scale Spark Workloads at ByteDance
Abstract
At ByteDance, where we execute over a million Spark jobs and handle 500PB of shuffled data daily, ensuring resource efficiency is paramount for cost savings. However, achieving optimization of resource efficiency in large-scale production environments poses significant challenges. Drawing from our practical experiences, we have identified three key issues critical to addressing resource efficiency in real-world production settings: ① slow I/Os leading to excessive CPU and memory idleness, ② coarse-grained resource control causing wastage, and ③ sub-optimal job configurations resulting in low utilization. To tackle these issues, we propose a resource efficiency governance framework for Spark workloads. Specifically, ① we devise the multi-mechanism shuffle services, including Enhanced External Shuffle Service (ESS) and Cloud Shuffle Service (CSS), where CSS employs a push-based approach to enhance I/O efficiency through sequential reading. ② We modify the Spark configuration parameter protocol, allowing for fine-grained resource control by introducing several new parameters such as milliCores and memoryBurst, as well as supporting operators with additional spill modes. ③ We design a two-stage configuration auto-tuning method, comprising rule-based and algorithm-based tuning, providing more reliable Spark configuration optimizations. By deploying these techniques on millions of Spark jobs in production over the last two years, we have achieved over 22% CPU utilization increase, 5% memory utilization increase, and 10% shuffle block time ratio decrease, effectively saving millions of CPU cores and petabytes of memory daily.
PVLDB is part of the VLDB Endowment Inc.
Privacy Policy