# Model-Free RL: Distributional RL

1. C51 (Categorical DQN)

2017: A Distributional Perspective on Reinforcement Learning

传统的RL模型都是对expected return进行建模（学习价值函数value function），这篇论文提出对random return的分布进行建模（学习价值分布value distribution）
两者之间的联系是：random return分布的期望就是expected return

the main object of our study is the random return $Z$ whose expectation is the value $Q$ .

价值函数和价值分布的Bellman equation对比：
$Q(x, a) = \mathbb{E} R(x, a) + \gamma \mathbb{E} Q(X', A') \\ Z(x, a) \overset{\underset{\mathrm{D}}{}}{=} R(x, a) + \gamma Z(X', A')$

在一系列理论分析之后，作者提出使用具有如下参数的离散分布来建模价值分布：

论文提出将sample Bellman update $\hat{\tau}Z_{\theta}$ 投影到 $Z_\theta$ ，有效地将Bellman update变成多分类，使用distributional Bellman operator进行更新的一个演示，其中 $\Phi \hat{\tau}Z_{\hat{\theta}}(x, a)$ 是投影更新：

训练使用的损失是 $\mathcal{L}_{x, a}(\theta)$ ，表示KL散度 $D_{KL} (\Phi \hat{\tau}Z_{\hat{\theta}}(x, a) || Z_\theta(x, a))$ ，其中：
$\hat{\tau}_{z_j} := r + \gamma z_j$

在上述基础上形成的算法叫Categorical Algorithm，如下所示：

当atom数量 $N=51$ 时，上述Categorical Algorithm算法效果显著，故称 $C51$ 。

As in DQN, we use a simple $\epsilon$ -greedy policy over the expected action-values; we leave as future work the many ways in which an agent could select actions on the basis of the full distribution.

2. QR-DQN

2017: Distributional Reinforcement Learning with Quantile Regression

C51算法首先执行启发式的投影步骤，然后最小化projected Bellman update和prediction之间的KL散度。然而遗留的问题是Wasserstein-metric理论与实际算法之间仍然有一个较大的gap。
这篇论文中提出使用quantile regression（分位数回归）进行更彻底的distributional RL。

论文提出参数化分位数分布替代C51中的参数化分布：

与原始参数化（C51）相比，参数化分位数分布的好处有三个：

First, we are not restricted to prespecified bounds on the support, or a uniform resolution, potentially leading to significantly more accurate predictions when the range of returns vary greatly across states.

This also lets us do away with the unwieldy projection step present in C51, as there are no issues of disjoint supports. Together, these obviate the need for domain knowledge about the bounds of the return distribution when applying the algorithm to new tasks.

Finally, this reparametrization allows us to minimize the Wasserstein loss, without suffering from biased gradients, specifically, using quantile regression.

分位数回归（quantile regression）损失定义为：
$\mathcal{L}_{QR}^{\tau}(\theta) := \mathbb{E}_{\hat{Z} \sim Z} [ \rho_{\tau} (\hat{Z} - \theta)], ~ \text{where} \\ \rho_{tau} (u) = u(\tau - \delta_{\{u < 0\}}), \forall u \in \mathbb{R}$

由于分位数回归损失在零点处不平滑，因此论文中使用一种改进的分位数Huber损失，该损失在零附近的区间 $[-k, k]$ 中充当非对称平方损失，并在此区间外恢复为标准的分位数损失。
Huber loss定义为：
$\mathcal{L}_k (u) = \begin{cases} \frac{1}{2}u^2 ,& \text{ if } |u| \leq k \\ k(|u| - \frac{1}{2}) ,& \text{ otherwise } \end{cases}$

Quantile Huber loss定义为Huber loss的非对称变体：
$\rho_{\tau}^k (u) = |\tau - \delta_{\{u<0\}}| \mathcal{L}_k (u)$

最终形成的Quantile Regression Q-learning（QR-DQN）算法：

3. IQN

2018: Implicit Quantile Networks for Distributional Reinforcement Learning

可以看做是DQN的分布泛化版本，同时综合了C51和QR-DQN的优势。
QR-DQN学习的是一个离散的分位数集，而IQN旨在学习一个完整的分位数函数，即从概率到回报的连续映射。结合基本分布如 $U([0, 1])$ ，可以形成一个隐式网络，能够在给定网络容量的情况下近似任何收益分布。

IQN主要具有以下三个优势：

First, the approximation error for the distribution is no longer controlled by the number of quantiles output by the network, but by the size of the network itself, and the amount of training.

Second, IQN can be used with as few, or as many, samples per update as desired, providing improved data efficiency with increasing number of samples per training update.

Third, the implicit representation of the return distribution allows us to expand the class of policies to more fully take advantage of the learned distribution. Specifically, by taking the base distribution to be non-uniform, we expand the class of policies to $\epsilon$ -greedy policies on arbitrary distortion risk measures.

分位数函数（quantile function）其实就是累积分布函数的反函数（inverse cumulative distribution function），因此分位数函数的定义域是 $[0, 1]$ ，值域是随机变量的取值范围。

回顾QR-DQN的关键公式：

IQN是一种经过训练的确定性参数函数，用于重新参数化来自基本分布的样本（如 $\tau \sim U([0, 1])$ ），对应于目标分布的各个分位数。
定义 $F_Z^{-1}$ 表示随机变量 $Z$ 在 $\tau \in [0, 1]$ 处的分位数函数，简写成 $Z_{\tau} := F_Z^{-1}$ ；定义 $\beta : [0, 1] \rightarrow [0, 1]$ 是一个扭曲风险度量（distortion risk measure）
$Z(x, a)$ 在 $\beta$ 下的扭曲期望（distorted expectation）定义为：

$Q_{\beta}(x, a) := \mathop{\mathbb{E}}_{\tau \sim U([0, 1])} [Z_{\beta(\tau)}(x, a)] = \int_{0}^{1}F_Z^{-1}(\tau) d \beta(\tau)$
Any distorted expectation can be represented as a weighted sum over the quantiles !

风险敏感的贪心策略 $\pi_{\beta}$ 表示为：
$\pi_{\beta}(x) = \mathop{argmax}_{a \in \mathcal{A}} Q_{\beta}(x, a)$

对于两个samples $\tau, \tau^{\prime} \sim U([0, 1])$ 和policy $\pi_{\beta}$ ，在step $t$ 的sampled TD error为：
$\delta_t^{\tau, \tau^{\prime}} = r_t + \gamma Z_{\tau^{'}} (x_{t+1}, \pi_{\beta}(x_{t+1})) - Z_{\tau}(x_t, a_t)$

IQN的损失函数为（ $N$ 和 $N^{'}$ 分别表示i.i.d.样本 $\tau_i, \tau_j^{'} \sim U([0, 1])$ 的数量）：
$\mathcal{L} (x_t, a_t, r_t, x_{t+1}) = \frac{1}{N'} \sum_{i=1}^{N} \sum_{j=1}^{N'} \rho_{\tau_i}^{k} (\delta^{\tau_i, \tau_j'})$

基于样本的风险敏感的策略 $Q_{\beta}$ 可以如下计算：
$\tilde{\pi}_{\beta}(x) = \mathop{argmax}_{a \in \mathcal{A}} \frac{1}{K} \sum_{k=1}^{K} Z_{\beta(\tilde{\tau}_k)} (x, a)$

有关IQN的具体实现细节（在DQN基础上的改进）：

4. Dopamine (code repository)

2018: Dopamine: A Research Framework for Deep Reinforcement Learning

包含DQN、C51、Rainbow和IQN的代码实现。

# Model-Free RL: Path-Consistency Learning

5. PCL

2017: Bridging the Gap Between Value and Policy Based Reinforcement Learning

建立了value-based和policy-based RL之间的联系，其基础是softmax temporal value consistency和policy potimality under entropy regularization之间的关系，证明了softmax一致的动作值对应于任何动作序列上的最优熵正则化策略梯度。
policy-based方法一般是on-policy的，优点是训练稳定，缺点是sample efficiency低；value-based方法一般是off-policy的，优点是sample efficiency高，缺点是不稳定。
Expected discounted objective $O_{ER}(s, \pi)$ 可以递归定义为：
$O_{ER}(s, \pi) = \sum_a \pi (a | s) [r(s, a) + \gamma O_{ER} (s', \pi)]$

定义regularized expected reward为expected reward和discounted entropy term的和：
$O_{ENT} (s, \pi) = O_{ER} (s, \pi) + \tau \mathbb{H}(s, \pi)$

$\mathbb{H}(s, \pi)$ 和 $O_{ENT}(s, \pi)$ 可以递归表示为：
$\mathbb{H}(s, \pi) = \sum_a \pi(a|s) [-\log \pi(a|s) + \gamma \mathbb{H}(s', \pi)] \\ O_{ENT} (s, \pi) = \sum_a \pi(a|s) [r(s, a) - \tau \log \pi(a|s) + \gamma O_{ENT}(s', \pi)]$

推导出的关键公式为：
$V^*(s_1) - \gamma^{t-1} V^*(s_t) = \sum_{i=1}^{t-1}\gamma^{i-1} [r(s_i, a_i) - \tau \log \pi^*(a_i|s_i)]$

PCL算法（policy $\pi_{\theta}$ ，参数为 $\theta$ ；value function $V_{\phi}$ ，参数为 $\phi$ ）
$C(s_{i:i+d}, \theta, \phi) = -V_{\phi}(s_i) + \gamma^d V_{\phi}(s_{i+d}) + \sum_{j=0}^{d-1}\gamma^j [r(s_{i+j}, a_{i+j}) - \tau \log \pi_{\theta}(a_{i+j}|s_{i+j})] \\ O_{PCL} (\theta, \phi) = \sum_{s_{i:i+d} \in E} \frac{1}{2} C(s_{i:i+d}, \theta, \phi)^2 \\ \Delta \theta = \eta_{\pi} C(s_{i: i+d}, \theta, \phi) \sum_{j-1}^{d-1} \gamma^j \nabla_{\theta} \log \pi_{\theta} (a_{i+j} | s_{i+j}) \\ \Delta \phi= \eta_v C(s_{i: i+d}, \theta, \phi) (\nabla_{\phi} V_{\phi}(s_i) - \gamma^d \nabla_{\phi} V_{\phi}(s_{i+d}))$

Unified PCL算法
$V_{\rho} (s) = \tau \log \sum_a \exp \left\{Q_{\rho} (s, a) / \tau \right\} \\ \pi_{\rho} (a|s) = \exp \left\{ (Q_{\rho}(s, a) - V_{\rho}(s)) / \tau \right\} \\ \Delta \rho = \eta_{\pi} C(s_{i: i+d}, \rho) \sum_{j-1}^{d-1} \gamma^j \nabla_{\rho} \log \pi_{\rho} (a_{i+j} | s_{i+j}) + \eta_v C(s_{i: i+d}, \rho) (\nabla_{\rho} V_{\rho}(s_i) - \gamma^d \nabla_{\rho} V_{\rho}(s_{i+d}))$

6. Trust-PCL

2017: Trust-PCL: An Off-Policy Trust Region Method for Continuous Control

结合了TRPO和PCL的优点，同时使用相对熵（relative entropy）和熵正则化（entropy regularization）进行策略优化。熵正则化有助于提高exploration，相对熵可以提高训练稳定性并允许更快的学习率。

$V^*(s_t) = \mathop{\mathbb{E}}_{r_{t+i}, s_{t+i}} \left[ \gamma^d V^*(s_{t+d}) + \sum_{i=0}^{d-1} \gamma^i (r_{t+i} - (\tau+\lambda) \log \pi^*(a_{t+i} | s_{t+i}) + \lambda \log \tilde{\pi} (a_{t+i} | s_{t+i})) \right] \\ C(s_{t: t+d}, \theta, \phi) = -V_{\phi}(s_t) + \gamma^d V_{\phi}(s_{t+d}) + \sum_{i=0}^{d-1} \gamma^i (r_{t+i} - (\tau+\lambda) \log \pi_{\theta}(a_{t+i} | s_{t+i}) + \lambda \log \pi_{\tilde{\theta}} (a_{t+i} | s_{t+i})) \\ \mathcal{L} (S, \theta, \phi) = \sum_{k=1}^B \sum_{t=1}^{T_k-1} C(s_{t: t+d}^{(k)}, \theta, \phi)^2$

论文比较凝练，很好的回顾了TRPO和PCL，建议反复看看。

# MARL

7. HATRPO/HAPPO

2022: Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning

将基于Trust Region的Policy Optimization方法应用到Multi-Agent设定一直是研究热点。之前的一些方法假设agents之间是同质的从而将TRPO/PPO参数共享来实现，但是缺乏单调改进的理论保证，并且可能导致次优策略。
论文将Trust Region方法扩展到cooperative MARL环境（异质性代理，Heterogeneous-Agent），主要贡献是the multi-agent advantage decomposition lemma和the sequential policy update scheme，以及在TRPO和PPO基础上的算法HATRPO和HAPPO。

强化学习整理-经典论文之Miscellaneous