CIDR Proceedings

This website is under development. If you come accross any issues, please report them to Konstantinos Kanellis (kkanellis@cs.wisc.edu) or Yannis Chronis (chronis@google.com).

Go Back

Bootleg: Chasing the Tail with Self-Supervised Named Entity Disambiguation

Authors:

Laurel Orr, Megan Leszczynski, Neel Guha, Sen Wu, Simran Arora, Xiao Ling, Christopher Ré

Download PDF

Abstract

Named Entity Disambiguation (NED) is the task of mapping textual mentions to entities in a database. A key challenge in NED is generalizing to rarely seen entities, termed tail entities. Traditional NED systems use hand-tuned features to improve tail generalization, but these features make the system challenging to deploy and maintain, especially in multiple locales. In 2018, a subset of the authors built a self-supervised NED system at Apple, which improved performance over its hand-tuned predecessor on a suite of downstream products. Motivated to understand the core reasons for this improvement, we introduce Bootleg, a clean-slate, open-source, self-supervised NED system.1 We first demonstrate that Bootleg matches or exceeds state-of-the-art performance on three NED benchmarks by up to 5.8 F1 points. Importantly, Bootleg improves performance over a BERT-based NED baseline by 41.2 F1 points on tail entities in Wikipedia using a simple transformer-based architecture and a hierarchical regularization scheme. Finally, we observe that embeddings from self-supervised models like Bootleg are increasingly being served to downstream applications, creating an embedding ecosystem. We initiate the study of the data management challenges associated with this ecosystem.