--- title: 'Improved Q-learning with continuous actions' summary: '' difficulty: 2 # out of 3 ---

Q-learning is one of the oldest and general reinforcement learning (RL) algorithms. It works by estimating the long term future expected return for each state-action pair. Essentially, the goal of the Q-learning algorithm is fold the long term outcome of each state-action pair into a single scalar that tells us how good the state-action pair combination is; then, we could maximize our reward by picking the action with the greatest value of the Q-function. The Q-learning algorithm has been the basis of the DQN algorithm that demonstrated that the combination of RL and deep learning is a fruitful one.

Your goal is to create a robust Q-learning implementation that can solve all Gym environments with continuous action spaces without changing hyperparameters.

You may want to use the Normalized Advantage Function (NAF) model as a starting point. It is especially interesting to experiment with variants of the NAF model: for example, try it with a diagonal covariance. It can be also interesting to explore an advantage function that uses the maximum of several quadratics, which is a convenient function because their argmax is easy to compute.


Notes

This project is mainly concerned with reimplementing an existing algorithm. However, there is significant value in obtaining a very robust implementation, and there is a decent chance that new ideas will end up being required to get it working reliably, across multiple tasks.