go back

Volume 14, No. 8

RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation

Authors:
Nan Tang (Qatar Computing Research Institute, HBKU), Ju Fan (Renmin University of China), Fangyi Li (Renmin University of China), Jianhong Tu (Renmin University of China), Xiaoyong Du (Renmin University of China), Guoliang Li (Tsinghua University), Samuel Madden (MIT), Mourad OUZZANI (Qatar Computing Research Institute, HBKU)

Abstract

Can AI help automate human-easy but computer-hard data preparation tasks that currently heavily involve data scientists, practitioners, and crowd workers? We answer this question by presenting RPT, a denoising autoencoder for tuple-to-X models (“X ” could be tuple, token, label, JSON, and so on). RPT is pre-trained for a tuple-to-tuple model by corrupting the input tuple and then learning a model to reconstruct the original tuple. It adopts a Transformer-based neural translation architecture that consists of a bidirectional en- coder (similar to BERT) and a left-to-right autoregressive decoder (similar to GPT), a generalization of both BERT and GPT. The pre-trained RPT can already support several common data preparation tasks such as data cleaning, auto-completion and schema matching. Better still, RPT can be fine-tuned on a wide range of data prepara- tion tasks, such as value normalization, data transformation, data annotation, etc. Beyond RPT, we also discuss several appealing techniques for data preparation, e.g., collaborative training and few-shot learning for entity resolution, and few-shot learning and NLP question-answering for information extraction. In addition, we also identify activities that will unleash a series of research opportunities to advance the field of data preparation.

PVLDB is part of the VLDB Endowment Inc.

Privacy Policy