Learning Collaborative Policies to Solve NPhard Routing Problems
Abstract
Recently, deep reinforcement learning (DRL) frameworks have shown potential for solving NPhard routing problems such as the traveling salesman problem (TSP) without problemspecific expert knowledge. Although DRL can be used to solve complex problems, DRL frameworks still struggle to compete with stateoftheart heuristics showing a substantial performance gap. This paper proposes a novel hierarchical problemsolving strategy, termed learning collaborative policies (LCP), which can effectively find the nearoptimum solution using two iterative DRL policies: the seeder and reviser. The seeder generates as diversified candidate solutions as possible (seeds) while being dedicated to exploring over the full combinatorial action space (i.e., sequence of assignment action). To this end, we train the seeder's policy using a simple yet effective entropy regularization reward to encourage the seeder to find diverse solutions. On the other hand, the reviser modifies each candidate solution generated by the seeder; it partitions the full trajectory into subtours and simultaneously revises each subtour to minimize its traveling distance. Thus, the reviser is trained to improve the candidate solution's quality, focusing on the reduced solution space (which is beneficial for exploitation). Extensive experiments demonstrate that the proposed twopolicies collaboration scheme improves over singlepolicy DRL framework on various NPhard routing problems, including TSP, prize collecting TSP (PCTSP), and capacitated vehicle routing problem (CVRP).
 Publication:

arXiv eprints
 Pub Date:
 October 2021
 arXiv:
 arXiv:2110.13987
 Bibcode:
 2021arXiv211013987K
 Keywords:

 Computer Science  Machine Learning;
 Statistics  Machine Learning
 EPrint:
 NeurIPS 2021, 23 pages, 8 figures