[Chapter 6] Reinforcement Learning (4) Policy Search

In the previous sections, we try to learn the utility function, or more usually, the action-value functions and greedily select the action with the highest Q-value:

{\pi}(s)=arg ⁡max_a⁡{Q(s,a)}

This means that once we have learnt the Q-function well, we can get an optimal policy, so before this, all methods were directly or indirectly learning the Q-function, however, for the policy search method, it tries to update the policy function directly.

Policy Search

Based on the function approximation, we can write the policy function as:

{\pi}(s)=arg ⁡max_a⁡{\hat{Q}(s,a)}

As a function mapping from state to action, the policy function is also a function with parameters {\theta} to learn. Then policy search method adjusts {\theta} to improve the policy directly without approximate the Q-values or utilities.

However, in the formula above, there are two main problem we need to solve firstly:

  • The operation arg⁡max is not differentiable, which makes the gradient based search difficult
  • In the environment with discrete actions, which means the outputs of the function are discrete

In fact, one method can solve them easily, you can think the problem to be a classification problem, why? When the agent selects an action, it selects the action with the highest Q-value regards the current state; in a classification problem, our model predicts the probability for each class that the input belongs to and output the class with the highest probability. They are one same thing actually. Remember how we solve the classification problem? Yes, we are using softmax function, here we can also use it:

{\pi}_{\theta}(s,a)=\frac{e^{\hat{Q}_{\theta}(s,a)}}{\sum_{a^′}{e^{\hat{Q}_{\theta}(s,a^′)}}}

Given a state s, the model can classify it to a class which indicates which action to execute (with highest Q-value).

Using the gradient method, we can get the parameter update formula:

{\theta}_{i+1}={\theta}_i+{\alpha}G_j \frac{\nabla_{\theta} {\pi}_{\theta}(s,a_i)}{{\pi}_{\theta} (s,a_i)}

Another version for the above formulas is to perform logarithmic operations on both sides of the equation, then we can get:

{\theta}_{i+1}={\theta}_i+{\alpha} G_j \nabla_{\theta} ln{{\pi}_{\theta}(s,a_i)}

Variance Reduction using a Baseline

Another technology is using a baseline to reduce the variance of the Q-function, to replace the Q_{{\pi}_{\theta} }(s,a) with Q_{{\pi}_{\theta}} (s,a)−B(s). Usually, a natural choice for the baseline is V_{{\pi}_{\theta}}(s), then we define a new advantage function:

A_{{\pi}_{\theta}}(s,a)=Q_{{\pi}_{\theta}} (s,a)−V_{{\pi}_{\theta}}(s)

Actor Critic

Actor-Critic algorithm tries to combine both the Q-function based learning and the policy search together. It establishes two outputs, one learns a policy that takes action, called actor, at the same time, another learns a value or Q-function that is used only for evaluation, called critic. It divided the evaluation and improvement into two parts, they are executed alternatively.

In the DRL, to save the memory and training time, we usually let these two parts share the bottom layers that are used for feature extracting and divide the network at a higher layer.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

友情链接更多精彩内容