go back

Volume 14, No. 11

Deep Learning for Blocking in Entity Matching: A Design Space Exploration

Authors:
Saravanan Thirumuruganathan (QCRI), Han Li (Amazon Alexa AI), Nan Tang (Qatar Computing Research Institute, HBKU), Mourad OUZZANI (Qatar Computing Research Institute, HBKU), Yash Govind (UW - Madison), Derek Paulsen (University of Wisconsin-Madison), Glenn M Fung (American Family Insurance), AnHai Doan (University of Wisconsin-Madison)

Abstract

Entity matching (EM) finds data instances that refer to the same real-world entity. Most EM solutions perform blocking then matching. Many works have applied deep learning (DL) to matching, but far fewer works have applied DL to blocking. These blocking works are also limited in that they consider only a simple form of DL and some of them require labeled training data. In this paper, we develop the DeepBlocker framework that significantly advances the state of the art in applying DL to blocking for EM. We first define a large space of DL solutions for blocking, which contains solutions of varying complexity and subsumes most previous works. Next, we develop eight representative solutions in this space. These solutions do not require labeled training data and exploit recent advances in DL (e.g., sequence modeling, transformer, self supervision). We empirically determine which solutions perform best on what kind of datasets (structured, textual, or dirty). We show that the best solutions (among the above eight) outperform the best existing DL solution and the best existing non-DL solutions (including a state-of-the-art industrial non-DL solution), on dirty and textual data, and are comparable on structured data. Finally, we show that the combination of the best DL and non-DL solutions can perform even better, suggesting a new venue for research.

PVLDB is part of the VLDB Endowment Inc.

Privacy Policy