Information-Theoretic Generative Clustering of Documents
Abstract
We present {\em generative clustering} (GC) for clustering a set of documents, $\mathrm{X}$, by using texts $\mathrm{Y}$ generated by large language models (LLMs) instead of by clustering the original documents $\mathrm{X}$. Because LLMs provide probability distributions, the similarity between two documents can be rigorously defined in an information-theoretic manner by the KL divergence. We also propose a natural, novel clustering algorithm by using importance sampling. We show that GC achieves the state-of-the-art performance, outperforming any previous clustering method often by a large margin. Furthermore, we show an application to generative document retrieval in which documents are indexed via hierarchical clustering and our method improves the retrieval accuracy.
- Publication:
-
arXiv e-prints
- Pub Date:
- December 2024
- DOI:
- arXiv:
- arXiv:2412.13534
- Bibcode:
- 2024arXiv241213534D
- Keywords:
-
- Computer Science - Machine Learning;
- Computer Science - Computation and Language;
- Computer Science - Information Retrieval;
- Computer Science - Information Theory
- E-Print:
- Accepted to AAAI 2025