Can Large Language Models Identify Authorship?
Abstract
The ability to accurately identify authorship is crucial for verifying content authenticity and mitigating misinformation. Large Language Models (LLMs) have demonstrated an exceptional capacity for reasoning and problem-solving. However, their potential in authorship analysis remains under-explored. Traditional studies have depended on hand-crafted stylistic features, whereas state-of-the-art approaches leverage text embeddings from pre-trained language models. These methods, which typically require fine-tuning on labeled data, often suffer from performance degradation in cross-domain applications and provide limited explainability. This work seeks to address three research questions: (1) Can LLMs perform zero-shot, end-to-end authorship verification effectively? (2) Are LLMs capable of accurately attributing authorship among multiple candidates authors (e.g., 10 and 20)? (3) Can LLMs provide explainability in authorship analysis, particularly through the role of linguistic features? Moreover, we investigate the integration of explicit linguistic features to guide LLMs in their reasoning processes. Our assessment demonstrates LLMs' proficiency in both tasks without the need for domain-specific fine-tuning, providing explanations into their decision making via a detailed analysis of linguistic features. This establishes a new benchmark for future research on LLM-based authorship analysis.
- Publication:
-
arXiv e-prints
- Pub Date:
- March 2024
- DOI:
- 10.48550/arXiv.2403.08213
- arXiv:
- arXiv:2403.08213
- Bibcode:
- 2024arXiv240308213H
- Keywords:
-
- Computer Science - Computation and Language
- E-Print:
- Accepted to EMNLP 2024 Findings. The main paper is 9 pages long, with 16 pages total. The code, results, dataset, and additional resources are available on the project website: https://llm-authorship.github.io/