_requests_for_research/better-sample-efficiency-for-trpo.html (27 lines of code) (raw):
---
title: 'Better sample efficiency for TRPO'
summary: ''
difficulty: 3 # out of 3
---
<p><a href="https://arxiv.org/pdf/1502.05477v4.pdf">Trust Region Policy Optimization (TRPO)</a> is a scalable implementation of
second order <a href="http://www.scholarpedia.org/article/Policy_gradient_methods">policy gradient algorithm</a> that is highly effective on both continuous and discrete control problems. One of the strengths of TRPO is that it is relatively easy to set its hyperparameters, as a hyperparameter setting that performs well on one task tends to perform well on many other tasks. But despite its significant advantages, the TRPO algorithm could be more data efficient.
</p>
<p>The problem is to
modify <a href="https://gym.openai.com/evaluations/eval_W27eCzLQBy60FciaSGSJw">a
good TRPO implementation</a> so that it would converge on all of
Gym's <a href="https://gym.openai.com/envs#mujoco">MuJoCo
environments</a> using 3x less experience, without a degradation in
final average reward. Ideally, the new code should use the same
setting of the hyperparameters for every problem. </p>
<p> This will be an impressive achievement, and the result will likely
be scientifically significant.</p>
<p>When designing the code, you may find the following ideas useful:
<ul>
<li> Dynamically adjust the size of the mini-batch.</li>
<li> Dynamically adjust the radius of the trust region.</li>
<li> Use past samples with correctly-set importance weights.</li>
</ul>
</p>
<hr>
<h3>Notes</h3>
<p>This problem is very hard, as getting an improvement of this magnitude is likely to require new ideas.</p>