Building an Evaluation Scale using Item Response Theory

doi:10.48550/arXiv.1605.08889

Building an Evaluation Scale using Item Response Theory

Evaluation of NLP methods requires testing against a previously vetted gold-standard test set and reporting standard metrics (accuracy/precision/recall/F1). The current assumption is that all items in a given test set are equal with regards to difficulty and discriminating power. We propose Item Response Theory (IRT) from psychometrics as an alternative means for gold-standard test-set generation and NLP system evaluation. IRT is able to describe characteristics of individual items - their difficulty and discriminating power - and can account for these characteristics in its estimation of human intelligence or ability for an NLP task. In this paper, we demonstrate IRT by generating a gold-standard test set for Recognizing Textual Entailment. By collecting a large number of human responses and fitting our IRT model, we show that our IRT model compares NLP systems with the performance in a human population and is able to provide more insight into system performance than standard evaluation metrics. We show that a high accuracy score does not always imply a high IRT score, which depends on the item characteristics and the response pattern.

Publication:

arXiv e-prints

Pub Date:

May 2016

DOI:

10.48550/arXiv.1605.08889

arXiv:

arXiv:1605.08889

Bibcode:

2016arXiv160508889L

Keywords:

Computer Science - Computation and Language

E-Print:

To appear in the proceedings of EMNLP 2016

NASA/ADS

Building an Evaluation Scale using Item Response Theory

Abstract