Sypse: Privacy-first Data Management through Pseudonymization and Partitioning
Abstract
Data privacy and ethical/responsible use of personal information are becoming increasingly important, in part because of new regulations like GDPR, CCPA, etc., and in part because of growing public distrust in companies’ handling of personal data. However, operationalizing privacy-by-design principles is difficult, especially given that current data management systems are designed primarily to make it easier and efficient to store, process, access, and share vast amounts of data. In this paper, we present a vision for transparently rearchitecting database systems by combining pseudonymization, synthetic data, and data partitioning to achieve three privacy goals: (1) reduce the impact of breaches by separating detailed personal information from personally identifying information (PII) and scrambling it, (2) make it easy to comply with a deletion request (“right to be forgotten”) through overwrites of portions of the data, and (3) reduce the need to access PII for developers or engineers. We present a general architecture as well as several potential strategies for achieving the goals, and some initial experimental results comparing the performance of the different strategies. We end with a discussion of some of the major research challenges moving forward.