Shades of BLEU, Flavours of Success: The Case of MultiWOZ
Abstract
The MultiWOZ dataset (Budzianowski et al.,2018) is frequently used for benchmarking context-to-response abilities of task-oriented dialogue systems. In this work, we identify inconsistencies in data preprocessing and reporting of three corpus-based metrics used on this dataset, i.e., BLEU score and Inform & Success rates. We point out a few problems of the MultiWOZ benchmark such as unsatisfactory preprocessing, insufficient or under-specified evaluation metrics, or rigid database. We re-evaluate 7 end-to-end and 6 policy optimization models in as-fair-as-possible setups, and we show that their reported scores cannot be directly compared. To facilitate comparison of future systems, we release our stand-alone standardized evaluation scripts. We also give basic recommendations for corpus-based benchmarking in future works.
- Publication:
-
arXiv e-prints
- Pub Date:
- June 2021
- DOI:
- 10.48550/arXiv.2106.05555
- arXiv:
- arXiv:2106.05555
- Bibcode:
- 2021arXiv210605555N
- Keywords:
-
- Computer Science - Computation and Language
- E-Print:
- Accepted to GEM Workshop at ACL 2021