_requests_for_research/better-sample-efficiency-for-trpo.html

--- title: 'Better sample efficiency for TRPO' summary: '' difficulty: 3 # out of 3 --- <a href="https://arxiv.org/pdf/1502.05477v4.pdf">Trust Region Policy Optimization (TRPO)</a> is a scalable implementation of second order <a href="http://www.scholarpedia.org/article/Policy_gradient_methods">policy gradient algorithm</a> that is highly effective on both continuous and discrete control problems. One of the strengths of TRPO is that it is relatively easy to set its hyperparameters, as a hyperparameter setting that performs well on one task tends to perform well on many other tasks. But despite its significant advantages, the TRPO algorithm could be more data efficient. The problem is to modify <a href="https://gym.openai.com/evaluations/eval_W27eCzLQBy60FciaSGSJw">a good TRPO implementation</a> so that it would converge on all of Gym's <a href="https://gym.openai.com/envs#mujoco">MuJoCo environments</a> using 3x less experience, without a degradation in final average reward. Ideally, the new code should use the same setting of the hyperparameters for every problem. This will be an impressive achievement, and the result will likely be scientifically significant. When designing the code, you may find the following ideas useful: <ul> <li> Dynamically adjust the size of the mini-batch.</li> <li> Dynamically adjust the radius of the trust region.</li> <li> Use past samples with correctly-set importance weights.</li> </ul> <hr> <h3>Notes</h3> This problem is very hard, as getting an improvement of this magnitude is likely to require new ideas.

_requests_for_research/better-sample-efficiency-for-trpo.html (27 lines of code) (raw):