Information-Theoretic Generative Clustering of Documents

doi:10.48550/arXiv.2412.13534

Information-Theoretic Generative Clustering of Documents

We present {\em generative clustering} (GC) for clustering a set of documents, $\mathrm{X}$, by using texts $\mathrm{Y}$ generated by large language models (LLMs) instead of by clustering the original documents $\mathrm{X}$. Because LLMs provide probability distributions, the similarity between two documents can be rigorously defined in an information-theoretic manner by the KL divergence. We also propose a natural, novel clustering algorithm by using importance sampling. We show that GC achieves the state-of-the-art performance, outperforming any previous clustering method often by a large margin. Furthermore, we show an application to generative document retrieval in which documents are indexed via hierarchical clustering and our method improves the retrieval accuracy.

Publication:

arXiv e-prints

Pub Date:

December 2024

DOI:

10.48550/arXiv.2412.13534

arXiv:

arXiv:2412.13534

Bibcode:

2024arXiv241213534D

Keywords:

Computer Science - Machine Learning;
Computer Science - Computation and Language;
Computer Science - Information Retrieval;
Computer Science - Information Theory

E-Print:

Accepted to AAAI 2025

ADS

Information-Theoretic Generative Clustering of Documents

Abstract