go back

Volume 15, No. 7

Selective Data Acquisition in the Wild for Model Charging

Authors:
Chengliang Chai (Tsinghua University) Jiabin Liu (Tsinghua University) Nan Tang (Qatar Computing Research Institute, HBKU) Guoliang Li (Tsinghua University)* Yuyu Luo (Tsinghua University)

Abstract

The lack of sufficient labeled data is a key bottleneck for practitioners in many real-world supervised machine learning (ML) tasks. In this paper, we study a new problem, namely selective data acquisition in the wild for model charging: given a supervised ML task and data in the wild (e.g. enterprise data warehouses, online data repositories, data markets, and so on), the problem is to select labeled data instances from the data in the wild as additional train data that can help the ML task. It consists of two steps. The first step is to discover relevant datasets (e.g. tables with similar relational schema), which will result in a set of candidate datasets. Because these candidate datasets come from different sources and may follow different distributions, not all data instances they contain can help. The second step is to select which data instances from these candidate datasets should be used. We build an end-to-end solution to solve this problem. For step 1, we piggyback off-the-shelf data discovery tools. Technically, our focus is on step 2, for which we propose a solution framework called Dataselect. It first clusters all data instances from candidate datasets such that each cluster contains similar data instances from different sources. It then iteratively picks which cluster to use, samples data instances (i.e., a mini-batch) from the picked cluster, evaluates the mini-batch, and then revises the search criteria by learning from the feedback (i.e., reward) based on the evaluation. We propose a multi-armed bandit based solution and a Deep Q Networks-based reinforcement learning solution. Experiments using both relational and image datasets show that our methods outperform baselines for selecting data instances from candidate datasets obtained from multiple sources, including using the entire candidate datasets, selecting only similar data instances, active learning-based methods, and using coresets.

PVLDB is part of the VLDB Endowment Inc.

Privacy Policy