CIDR Proceedings

CryoDrill: Near-Data Processing in Deep and Cold Storage Hierarchies

Authors:

Marcus Paradies

Abstract

Nowadays, modern high-performance, high-resolution observational instruments and complex models of the earth system and of physical, chemical, and biological processes generate multiple hundreds of petabytes of scientific data per year contributing significantly to the infamous data deluge. Important application domains, which generate large amounts of scientific data, include earth observation, high-energy physics, radio astronomy, and weather forecasting, among others. Increasingly, digital data archives store such scientific data in private cloud infrastructures for further investigation and longterm preservation, and disseminate them through data platforms via order-based catalogs. To reduce the total cost of ownership, such data platforms employ a hierarchical storage management system with large, disk-based caches and robotic tape libraries [2]. Often, data is stored in coarse-grained, value-added data products, whose granularity is determined by observational instead of data access characteristics. Fine-granular data tiling of such data products is often only applied late in the data analysis workflow. Prefetching all the data from a slower storage layer in advance is typically not possible due to the ad-hoc nature of scientific analysis tasks and the sheer size of the required data working set to achieve satisfactory results for long-term trend analysis and prediction. With the proliferation of scientific data for general analysis by the academic community, wasteful data transfers across storage layers and unacceptable high data access latencies will likely become a major research barrier in the near future.