WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction
Abstract
We describe a open-domain information extraction method for extracting concept-instance pairs from an HTML corpus. Most earlier approaches to this problem rely on combining clusters of distributionally similar terms and concept-instance pairs obtained with Hearst patterns. In contrast, our method relies on a novel approach for clustering terms found in HTML tables, and then assigning concept names to these clusters using Hearst patterns. The method can be efficiently applied to a large corpus, and experimental results on several datasets show that our method can accurately extract large numbers of concept-instance pairs.
- Publication:
-
arXiv e-prints
- Pub Date:
- June 2013
- DOI:
- 10.48550/arXiv.1307.0261
- arXiv:
- arXiv:1307.0261
- Bibcode:
- 2013arXiv1307.0261D
- Keywords:
-
- Computer Science - Machine Learning;
- Computer Science - Computation and Language;
- Computer Science - Information Retrieval
- E-Print:
- 10 pages