LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation

doi:10.48550/arXiv.2406.02863

LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation

This research investigates the effect of prompt design on dialogue evaluation using large language models (LLMs). While LLMs are increasingly used for scoring various inputs, creating effective prompts for dialogue evaluation remains challenging due to model sensitivity and subjectivity in dialogue assessments. Our study experimented with different prompt structures, altering the sequence of output instructions and including explanatory reasons. We found that the order of presenting reasons and scores significantly influences LLMs' scoring, with a "reason-first" approach yielding more comprehensive evaluations. This insight is crucial for enhancing the accuracy and consistency of LLM-based evaluations.

Publication:

arXiv e-prints

Pub Date:

June 2024

DOI:

10.48550/arXiv.2406.02863

arXiv:

arXiv:2406.02863

Bibcode:

2024arXiv240602863C

Keywords:

Computer Science - Computation and Language

E-Print:

Presented in AAAI 2024 Spring Symposium. The first two authors contributed equally

NASA/ADS

LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation

Abstract