go back
go back
Volume 14, No. 8
RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation
Abstract
Can AI help automate human-easy but computer-hard data preparation tasks that currently heavily involve data scientists, practitioners, and crowd workers? We answer this question by presenting RPT, a denoising autoencoder for tuple-to-X models (“X ” could be tuple, token, label, JSON, and so on). RPT is pre-trained for a tuple-to-tuple model by corrupting the input tuple and then learning a model to reconstruct the original tuple. It adopts a Transformer-based neural translation architecture that consists of a bidirectional en- coder (similar to BERT) and a left-to-right autoregressive decoder (similar to GPT), a generalization of both BERT and GPT. The pre-trained RPT can already support several common data preparation tasks such as data cleaning, auto-completion and schema matching. Better still, RPT can be fine-tuned on a wide range of data prepara- tion tasks, such as value normalization, data transformation, data annotation, etc. Beyond RPT, we also discuss several appealing techniques for data preparation, e.g., collaborative training and few-shot learning for entity resolution, and few-shot learning and NLP question-answering for information extraction. In addition, we also identify activities that will unleash a series of research opportunities to advance the field of data preparation.
PVLDB is part of the VLDB Endowment Inc.
Privacy Policy