Cautious Reinforcement Learning with Logical Constraints
Abstract
This paper presents the concept of an adaptive safe padding that forces Reinforcement Learning (RL) to synthesise optimal control policies while ensuring safety during the learning process. Policies are synthesised to satisfy a goal, expressed as a temporal logic formula, with maximal probability. Enforcing the RL agent to stay safe during learning might limit the exploration, however we show that the proposed architecture is able to automatically handle the trade-off between efficient progress in exploration (towards goal satisfaction) and ensuring safety. Theoretical guarantees are available on the optimality of the synthesised policies and on the convergence of the learning algorithm. Experimental results are provided to showcase the performance of the proposed method.
- Publication:
-
arXiv e-prints
- Pub Date:
- February 2020
- DOI:
- 10.48550/arXiv.2002.12156
- arXiv:
- arXiv:2002.12156
- Bibcode:
- 2020arXiv200212156H
- Keywords:
-
- Computer Science - Machine Learning;
- Computer Science - Artificial Intelligence;
- Computer Science - Logic in Computer Science;
- Electrical Engineering and Systems Science - Systems and Control;
- Statistics - Machine Learning
- E-Print:
- Accepted to AAMAS 2020. arXiv admin note: text overlap with arXiv:1902.00778