Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers

doi:10.48550/arXiv.2310.10627

Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers

Hallucination plagues even frontier LLMs--but how bad is it really for summarizing academic papers? We evaluate Factored Verification, a simple automated method for detecting hallucinations in abstractive summaries. This method sets a new SotA on hallucination detection in the summarization task of the HaluEval benchmark, achieving 76.2% accuracy. We then use this method to estimate how often language models hallucinate when summarizing across multiple academic papers and find 0.62 hallucinations in the average ChatGPT (16k) summary, 0.84 for GPT-4, and 1.55 for Claude 2. We ask models to self-correct using Factored Critiques and find that this lowers the number of hallucinations to 0.49 for ChatGPT, 0.46 for GPT-4, and 0.95 for Claude 2. The hallucinations we find are often subtle, so we advise caution when using models to synthesize academic papers.

Publication:

arXiv e-prints

Pub Date:

October 2023

DOI:

10.48550/arXiv.2310.10627

arXiv:

arXiv:2310.10627

Bibcode:

2023arXiv231010627G

Keywords:

Computer Science - Computation and Language;
Computer Science - Artificial Intelligence

E-Print:

Second Workshop on Information Extraction from Scientific Publications (WIESP) at IJCNLP-AACL 2023

NASA/ADS

Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers

Abstract