Model-Free RL: Policy Gradients

1. TRPO

2015: Trust Region Policy Optimization

论文通过引入old policy和new policy之间的相对熵（KL散度）约束来改进natural policy gradient算法，使得算法能采取尽可能大的更新一步，同时保证单调改进（monotonic improvement）。

TRPO from spinningup doc

pseudocode for TRPO from spinningup doc

2. GAE

2015: High-Dimensional Continuous Control Using Generalized Advantage Estimation

The contributions of this paper are summarized as follows:

We provide justification and intuition for an effective variance reduction scheme for policy gradients, which we call generalized advantage estimation (GAE). While the formula has been proposed in prior work, our analysis is novel and enables GAE to be applied with a more general set of algorithms, including the batch trust-region algorithm we use for our experiments.

We propose the use of a trust region optimization method for the value function, which we find is a robust and efficient way to train neural network value functions with thousands of parameters.

注意：上式中 $A^{\pi, \gamma}(s_t, a_t)$ 未知，需要进行估计。不同于往常的advantage estimation function，GAE中提出类似于TD( $\lambda$ )的优势函数的指数加权估计量（biased but not too biased），如下：

惊人的是，上述定义的 $GAE(\gamma, \lambda)$ 化简后形式上非常简单：
$\hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}^V, ~\text{where}~ \delta_t^V = r_t + \gamma V(s_{t+1}) - V(s_t)$

在Value Function Estimation中，论文提出使用Trust Region方法进行估计：

3. A3C

2016: Asynchronous Methods for Deep Reinforcement Learning

论文提出了一个概念简单且轻量级的DRL框架，使用异步梯度下降（asynchronous gradient descent）进行优化。

Instead of experience replay, we asynchronously execute multiple agents in parallel, on multiple instances of the environment. This parallelism also decorrelates the agents’ data into a more stationary process, since at any given time-step the parallel agents will be experiencing a variety of different states.
Hence, we do not use a replay memory and rely on parallel actors employing different exploration policies to perform
the stabilizing role undertaken by experience replay in the DQN training algorithm.

4. ACER

2016: Sample Efficient Actor-Critic with Experience Replay

5. ACKTR

2017: Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation

6. PPO

2017: Proximal Policy Optimization Algorithms

PPO的motivation与TRPO类似：如何利用现有数据对策略进行尽可能大的改进，而不至于意外导致performance collapse。PPO可以达到TRPO的数据效率和可靠性能，并且只使用一阶优化实现。
论文对policy gradient方法、trust region方法和surrogate objective的介绍比较凝练！
PPO有两种实现途径：PPO-Penalty和PPO-Clip。

PPO-Penalty将TRPO中对KL散度的硬约束改为penalty的形式（ $\beta$ 动态调整）
$L^{KLPEN}(\theta) = \hat{\mathbb{E}} \left[ \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)} \hat{A_t} - \beta KL \left[ \pi_{\theta_{old}}(\cdot | s_t), \pi_{\theta}(\cdot | s_t) \right] \right]$
PPO-Clip使用Clipped Surrogate Objective来实现对new policy的约束
$L^{CLIP}(\theta) = \hat{\mathbb{E}} \left[ \min (r_t(\theta)\hat{A_t}, clip(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A_t}) \right], ~\text{where} ~r_t{\theta} = \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}$
在具体实现中，作者使用了value function error term和entropy regularization term来改进目标函数
$L^{CLIP+VF+S}_t(\theta) = \hat{\mathbb{E}} \left[ L^{CLIP}_t(\theta) + c_1 L^{VF}_t(\theta) + c_2 S[\pi_{\theta}] (s_t) \right]$

7. SAC

2018: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

论文提出了一种off-policy maximum entropy actor-critic算法，actor的目标是最大化预期回报，同时也最大化熵，该算法能提高sample efficiency和stability。
spinningup文档中对SAC的解释非常详细，建议反复回顾（实现比论文中更简洁，没有使用state value function）【链接】
In SAC, the Q-functions are learned in a similar way to TD3, but with a few key differences.

# First, what’s similar?

Like in TD3, both Q-functions are learned with MSBE minimization, by regressing to a single shared target.
Like in TD3, the shared target is computed using target Q-networks, and the target Q-networks are obtained by polyak averaging the Q-network parameters over the course of training.
Like in TD3, the shared target makes use of the clipped double-Q trick.

# What’s different?

Unlike in TD3, the target also includes a term that comes from SAC’s use of entropy regularization.
Unlike in TD3, the next-state actions used in the target come from the current policy instead of a target policy.
Unlike in TD3, there is no explicit target policy smoothing. TD3 trains a deterministic policy, and so it accomplishes smoothing by adding random noise to the next-state actions. SAC trains a stochastic policy, and so the noise from that stochasticity is sufficient to get a similar effect.

论文中的实现借助state value function $V_{\psi}(s_t)$ 、soft Q-function $Q_{\theta}(s_t, a_t)$ 和policy $\pi_{\phi}(a_t|s_t)$ 。
SAC和TD3的主要区别在于，SAC在Q-function的target和policy的optimization中都加上了entropy regularization。

There is no need in principle to include a separate function approximator for the state value, since it is related to the Q-function and policy according to Equation 3. This quantity can be estimated from a single action sample from the current policy without introducing a bias, but in practice, including a separate function approximator for the soft value can stabilize training and is convenient to train simultaneously with the other networks.
# state value function $V_{\psi}(s_t)$ 不是必须的，但是使用它可以稳定训练。

pseudocode for SAC from spinningup doc

Model-Free RL: Deterministic Policy Gradients

1. DPG

2014: Deterministic Policy Gradient Algorithms

详见论文，干货，理论性较强。

2. DDPG

2015: Continuous Control With Deep Reinforcement Learning

DDPG可以看作是DQN和DPG的结合，使用Deep Q-learning来学习Q-function（critic），同时使用学习到的Q-function和policy gradient来学习policy（actor），综合来看是DQN在continuous action spaces的扩展。
根据DPG中Off-Policy Deterministic Actor-Critic的结果，actor policy更新的梯度为：
$\begin{eqnarray} \nabla_{\theta^{\mu}} J &\approx& \mathbb{E}_{s_t \sim \rho^{\beta}} \left[\nabla_{\theta^{\mu}} Q(s, a|\theta^Q) |_{s=s_t, a=\mu(s_t | \theta^{\mu})} \right] \\ &=& \mathbb{E}_{s_t \sim \rho^{\beta}} \left[\nabla_a Q(s, a|\theta^Q) |_{s=s_t, a=\mu(s_t)} \nabla_{\theta^{\mu}} \mu(s|\theta^{\mu}) |_{s=s_t} \right] \end{eqnarray}$

3. TD3

2018: Addressing Function Approximation Error in Actor-Critic Methods

TD3（Twin Delayed DDPG）旨在解决DDPG中的overestimation bias的问题，也就是actor-critic方法的value overestimation问题，以及high variance estimates的问题。
主要有以下三个技巧（解决办法）：

Trick One: Clipped Double-Q Learning. TD3 learns two Q-functions instead of one (hence “twin”), and uses the smaller of the two Q-values to form the targets in the Bellman error loss functions.
Trick Two: Delayed Policy Updates. TD3 updates the policy (and target networks) less frequently than the Q-function. The paper recommends one policy update for every two Q-function updates.
Trick Three: Target Policy Smoothing. TD3 adds noise to the target action, to make it harder for the policy to exploit Q-function errors by smoothing out Q along changes in action.

pseudocode for TD3 from spinningup doc

强化学习整理-经典论文之Policy Gradients