Migrating a Privacy-Safe Information Extraction System to a Software 2.0 Design
Abstract
This paper presents a case study of migrating a privacy-safe information extraction system in production for Gmail from a traditional rule-based architecture to a machine-learned Software 2.0 architecture. The key idea is to use the extractions from the existing rule-based system as training data to learn models that in turn replace all the machinery for the rule-based system. The resulting system a) delivers better precision and recall, b) is significantly smaller in terms of lines of code, c) is easier to maintain and improve, and d) allowed us to leverage machine learning advances to build a cross-language extraction system even though our original training data was only in English. We describe challenges encountered during this migration around generation and management of training data, evaluation of models, and report on many traditional “Software 1.0” components we built to address them.