Fast Slate Policy Optimization: Going Beyond Plackett-Luce
Abstract
An increasingly important building block of large scale machine learning systems is based on returning slates; an ordered lists of items given a query. Applications of this technology include: search, information retrieval and recommender systems. When the action space is large, decision systems are restricted to a particular structure to complete online queries quickly. This paper addresses the optimization of these large scale decision systems given an arbitrary reward function. We cast this learning problem in a policy optimization framework and propose a new class of policies, born from a novel relaxation of decision functions. This results in a simple, yet efficient learning algorithm that scales to massive action spaces. We compare our method to the commonly adopted Plackett-Luce policy class and demonstrate the effectiveness of our approach on problems with action space sizes in the order of millions.
- Publication:
-
arXiv e-prints
- Pub Date:
- August 2023
- DOI:
- 10.48550/arXiv.2308.01566
- arXiv:
- arXiv:2308.01566
- Bibcode:
- 2023arXiv230801566S
- Keywords:
-
- Computer Science - Machine Learning;
- Computer Science - Information Retrieval;
- Statistics - Machine Learning
- E-Print:
- Transactions on Machine Learning Research