Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding
Abstract
Unsupervised discovery of stories with correlated news articles in real-time helps people digest massive news streams without expensive human annotations. A common approach of the existing studies for unsupervised online story discovery is to represent news articles with symbolic- or graph-based embedding and incrementally cluster them into stories. Recent large language models are expected to improve the embedding further, but a straightforward adoption of the models by indiscriminately encoding all information in articles is ineffective to deal with text-rich and evolving news streams. In this work, we propose a novel thematic embedding with an off-the-shelf pretrained sentence encoder to dynamically represent articles and stories by considering their shared temporal themes. To realize the idea for unsupervised online story discovery, a scalable framework USTORY is introduced with two main techniques, theme- and time-aware dynamic embedding and novelty-aware adaptive clustering, fueled by lightweight story summaries. A thorough evaluation with real news data sets demonstrates that USTORY achieves higher story discovery performances than baselines while being robust and scalable to various streaming settings.
- Publication:
-
arXiv e-prints
- Pub Date:
- April 2023
- DOI:
- 10.48550/arXiv.2304.04099
- arXiv:
- arXiv:2304.04099
- Bibcode:
- 2023arXiv230404099Y
- Keywords:
-
- Computer Science - Information Retrieval;
- Computer Science - Computation and Language;
- Computer Science - Databases;
- Computer Science - Machine Learning
- E-Print:
- Accepted by SIGIR'23