How to Compute the Probability of a Word

doi:10.48550/arXiv.2406.14561

How to Compute the Probability of a Word

Language models (LMs) estimate a probability distribution over strings in a natural language; these distributions are crucial for computing perplexity and surprisal in linguistics research. While we are usually concerned with measuring these values for words, most LMs operate over subwords. Despite seemingly straightforward, accurately computing probabilities over one unit given probabilities over the other requires care. Indeed, we show here that many recent linguistic studies have been incorrectly computing these values. This paper derives the correct methods for computing word probabilities, highlighting issues when relying on language models that use beginning-of-word (bow)-marking tokenisers, e.g., the GPT family. Empirically, we show that correcting the widespread bug in probability computations affects measured outcomes in sentence comprehension and lexical optimisation analyses.

Publication:

arXiv e-prints

Pub Date:

June 2024

DOI:

10.48550/arXiv.2406.14561

arXiv:

arXiv:2406.14561

Bibcode:

2024arXiv240614561P

Keywords:

Computer Science - Computation and Language

E-Print:

Camera ready version for EMNLP 2024. Our code is available in https://github.com/tpimentelms/probability-of-a-word

NASA/ADS

How to Compute the Probability of a Word

Abstract