Comparative Analysis of Document-Level Embedding Methods for Similarity Scoring on Shakespeare Sonnets and Taylor Swift Lyrics
Abstract
This study evaluates the performance of TF-IDF weighting, averaged Word2Vec embeddings, and BERT embeddings for document similarity scoring across two contrasting textual domains. By analysing cosine similarity scores, the methods' strengths and limitations are highlighted. The findings underscore TF-IDF's reliance on lexical overlap and Word2Vec's superior semantic generalisation, particularly in cross-domain comparisons. BERT demonstrates lower performance in challenging domains, likely due to insufficient domainspecific fine-tuning.
- Publication:
-
arXiv e-prints
- Pub Date:
- December 2024
- DOI:
- arXiv:
- arXiv:2412.17552
- Bibcode:
- 2024arXiv241217552K
- Keywords:
-
- Computer Science - Computation and Language;
- Computer Science - Information Retrieval
- E-Print:
- 9 pages, 4 figures