Towards Universal Dense Blocking for Entity Resolution
Abstract
Blocking is a critical step in entity resolution, and the emergence of neural network-based representation models has led to the development of dense blocking as a promising approach for exploring deep semantics in blocking. However, previous advanced self-supervised dense blocking approaches require domain-specific training on the target domain, which limits the benefits and rapid adaptation of these methods. To address this issue, we propose UniBlocker, a dense blocker that is pre-trained on a domain-independent, easily-obtainable tabular corpus using self-supervised contrastive learning. By conducting domain-independent pre-training, UniBlocker can be adapted to various downstream blocking scenarios without requiring domain-specific fine-tuning. To evaluate the universality of our entity blocker, we also construct a new benchmark covering a wide range of blocking tasks from multiple domains and scenarios. Our experiments show that the proposed UniBlocker, without any domain-specific learning, significantly outperforms previous self- and unsupervised dense blocking methods and is comparable and complementary to the state-of-the-art sparse blocking methods.
- Publication:
-
arXiv e-prints
- Pub Date:
- April 2024
- DOI:
- 10.48550/arXiv.2404.14831
- arXiv:
- arXiv:2404.14831
- Bibcode:
- 2024arXiv240414831W
- Keywords:
-
- Computer Science - Databases;
- Computer Science - Computation and Language;
- Computer Science - Information Retrieval
- E-Print:
- Code and data are available at this https://github.com/tshu-w/Uniblocker