Improvement of DPG gradient on environments with discrete actions
Abstract
The project evaluates a new reinforcement learning algorithm on the Deep Mind control suite.
The new algorithm proposes a new way to utilize the strengths of deterministic policy gradient (DPG) style gradients, on environments with discrete actions. A common method is to apply Gumbel noise to the action distribution. However this approach suffers from the critic learning a "step function" instead of the true expected return for each distribution. This can cause less accurate gradients to flow to the actor during training. In our proposed method, the critic is trained to learn an accurate representation of the expected return for any provided action distribution, causing gradients to have higher theoretical guarantee of being accurate.