Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers
Abstract
Hallucination plagues even frontier LLMs--but how bad is it really for summarizing academic papers? We evaluate Factored Verification, a simple automated method for detecting hallucinations in abstractive summaries. This method sets a new SotA on hallucination detection in the summarization task of the HaluEval benchmark, achieving 76.2% accuracy. We then use this method to estimate how often language models hallucinate when summarizing across multiple academic papers and find 0.62 hallucinations in the average ChatGPT (16k) summary, 0.84 for GPT-4, and 1.55 for Claude 2. We ask models to self-correct using Factored Critiques and find that this lowers the number of hallucinations to 0.49 for ChatGPT, 0.46 for GPT-4, and 0.95 for Claude 2. The hallucinations we find are often subtle, so we advise caution when using models to synthesize academic papers.
- Publication:
-
arXiv e-prints
- Pub Date:
- October 2023
- DOI:
- 10.48550/arXiv.2310.10627
- arXiv:
- arXiv:2310.10627
- Bibcode:
- 2023arXiv231010627G
- Keywords:
-
- Computer Science - Computation and Language;
- Computer Science - Artificial Intelligence
- E-Print:
- Second Workshop on Information Extraction from Scientific Publications (WIESP) at IJCNLP-AACL 2023