Strategy Masking: A Method for Guardrails in Value-based Reinforcement Learning Agents

doi:10.48550/arXiv.2501.05501

Strategy Masking: A Method for Guardrails in Value-based Reinforcement Learning Agents

The use of reward functions to structure AI learning and decision making is core to the current reinforcement learning paradigm; however, without careful design of reward functions, agents can learn to solve problems in ways that may be considered ``undesirable" or ``unethical. Without thorough understanding of the incentives a reward function creates, it can be difficult to impose principled yet general control mechanisms over its behavior. In this paper, we study methods for constructing guardrails for AI agents that use reward functions to learn decision making. We introduce a novel approach, which we call strategy masking, to explicitly learn and then suppress undesirable AI agent behavior. We apply our method to study lying in AI agents and show that strategy masking can effectively modify agent behavior by suppressing, or actively penalizing, the reward dimension for lying such that agents act more honestly while not compromising their ability to perform effectively.

Publication:

arXiv e-prints

Pub Date:

January 2025

DOI:

10.48550/arXiv.2501.05501

arXiv:

arXiv:2501.05501

Bibcode:

2025arXiv250105501K

Keywords:

Computer Science - Artificial Intelligence;
Computer Science - Machine Learning;
Computer Science - Multiagent Systems;
I.2.0

ADS

Strategy Masking: A Method for Guardrails in Value-based Reinforcement Learning Agents

Abstract