Topology-aware Parallel Data Processing: Models, Algorithms and Systems at Scale
Abstract
The analysis of massive datasets requires a large number of processors. Prior research has largely assumed that tracking the actual data distribution and the underlying network structure of a cluster, which we collectively refer to as the topology, comes with a high cost and has little practical benefit. As a result, theoretical models, algorithms and systems often assume a uniform topology; however this assumption rarely holds in practice. This necessitates an end-to-end investigation of how one can model, design and deploy topology-aware algorithms for fundamental data processing tasks at large scale. To achieve this goal, we first develop a theoretical parallel model that can jointly capture the cost of computation and communication. Using this model, we explore algorithms with theoretical guarantees for three basic tasks: aggregation, join, and sorting. Finally, we consider the practical aspects of implementing topology-aware algorithms at scale, and show that they have the potential to be orders of magnitude faster than their topology-oblivious counterparts.