Enhancing RL Safety with Counterfactual LLM Reasoning
Abstract
Reinforcement learning (RL) policies may exhibit unsafe behavior and are hard to explain. We use counterfactual large language model reasoning to enhance RL policy safety post-training. We show that our approach improves and helps to explain the RL policy safety.
- Publication:
-
arXiv e-prints
- Pub Date:
- September 2024
- DOI:
- arXiv:
- arXiv:2409.10188
- Bibcode:
- 2024arXiv240910188G
- Keywords:
-
- Computer Science - Machine Learning