Origami: A High-Performance Mergesort Framework

Authors:

Arif Arman (Texas A&ampM University)* Dmitri Loguinov (Texas A&ampM University)

Download PDF

Abstract

Mergesort is a popular kernel for real-world workloads as it is immune to data skewness, supports streaming operations, and is relatively less complex to parallelize. We introduce Origami, a mergesort framework that presents a four-phase pipeline and develops optimized solutions for each phase that are generalizable to scalar and all SIMD CPU architectures. Within the same extension set, we present the fastest in-register method for sorting small sequences and a branchless merger of sorted lists that achieves upto $1.5\times$ speedup over the naive merge. In addition, we introduce a cache residing quad merge tree to avoid bottlenecking on memory bandwidth and a parallel partitioning scheme to maximize thread level parallelism. We develop the end-to-end sort with these components and produce a highly utilized pipeline by reducing synchronization overheads between phases. Origami single thread mergesort performs upto $2\times$ faster than the closest competitor and achieves a nearly perfect speedup in multi-core environments.

PVLDB is part of the VLDB Endowment Inc.

Start

Current Submission

All Volumes

Reproducibility

General Information

Volume 15, No. 2

Origami: A High-Performance Mergesort Framework

Abstract