_requests_for_research/improved-q-learning-with-continuous-actions.html (40 lines of code) (raw):

--- title: 'Improved Q-learning with continuous actions' summary: '' difficulty: 2 # out of 3 --- <p> <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.466.7149&rep=rep1&type=pdf">Q-learning</a> is one of the oldest and general reinforcement learning (RL) algorithms. It works by estimating the long term future expected return for each state-action pair. Essentially, the goal of the Q-learning algorithm is fold the long term outcome of each state-action pair into a single scalar that tells us how good the state-action pair combination is; then, we could maximize our reward by picking the action with the greatest value of the Q-function. The Q-learning algorithm has been the basis of the <a href="http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html">DQN</a> algorithm that demonstrated that the combination of RL and deep learning is a fruitful one. </p> <p>Your goal is to create a robust Q-learning implementation that can solve all <a href="https://gym.openai.com">Gym</a> environments with continuous action spaces without changing hyperparameters. </p> <p> You may want to use the <a href="http://arxiv.org/pdf/1603.00748.pdf">Normalized Advantage Function (NAF)</a> model as a starting point. It is especially interesting to experiment with variants of the NAF model: for example, try it with a diagonal covariance. It can be also interesting to explore an advantage function that uses the maximum of several quadratics, which is a convenient function because their argmax is easy to compute. </p> <hr> <h3>Notes</h3> <p> This project is mainly concerned with reimplementing an existing algorithm. However, there is significant value in obtaining a very robust implementation, and there is a decent chance that new ideas will end up being required to get it working reliably, across multiple tasks. </p>