Peacock: Learning Long-Tail Topic Features for Industrial Applications
Abstract
Latent Dirichlet allocation (LDA) is a popular topic modeling technique in academia but less so in industry, especially in large-scale applications involving search engine and online advertising systems. A main underlying reason is that the topic models used have been too small in scale to be useful; for example, some of the largest LDA models reported in literature have up to $10^3$ topics, which cover difficultly the long-tail semantic word sets. In this paper, we show that the number of topics is a key factor that can significantly boost the utility of topic-modeling systems. In particular, we show that a "big" LDA model with at least $10^5$ topics inferred from $10^9$ search queries can achieve a significant improvement on industrial search engine and online advertising systems, both of which serving hundreds of millions of users. We develop a novel distributed system called Peacock to learn big LDA models from big data. The main features of Peacock include hierarchical distributed architecture, real-time prediction and topic de-duplication. We empirically demonstrate that the Peacock system is capable of providing significant benefits via highly scalable LDA topic models for several industrial applications.
- Publication:
-
arXiv e-prints
- Pub Date:
- May 2014
- DOI:
- 10.48550/arXiv.1405.4402
- arXiv:
- arXiv:1405.4402
- Bibcode:
- 2014arXiv1405.4402W
- Keywords:
-
- Computer Science - Information Retrieval;
- Computer Science - Distributed;
- Parallel;
- and Cluster Computing
- E-Print:
- 23 pages, 11 figures, ACM Transactions on Intelligent Systems and Technology, 2015