We describe the bailout of banks by governments as a Markov Decision Process (MDP) where the actions are equity investments. The underlying dynamics is derived from the network of financial institutions linked by mutual exposures, and the negative rewards are associated to the banks' default. Each node represents a bank and is associated to a probability of default per unit time (PD) that depends on its capital and is increased by the default of neighbouring nodes. Governments can control the systemic risk of the network by providing additional capital to the banks, lowering their PD at the expense of an increased exposure in case of their failure. Considering the network of European global systemically important institutions, we find the optimal investment policy that solves the MDP, providing direct indications to governments and regulators on the best way of action to limit the effects of financial crises.