【5分钟 Paper】Reinforcement Learning with Deep Energy-Based Policies

论文题目：Reinforcement Learning with Deep Energy-Based Policies

标题及作者信息

所解决的问题？

作者提出一种energy-based 的强化学习算法，将其运用于连续的状态和动作空间问题中，将其称之为Soft Q-Learning。这种算法的好处就是鲁棒性和tasks之间的skills transfer。

背景

以往的方法是通过stochastic policy来增加一点exploration，例如增加噪声，或者使用一个entropy很高的policy来对其进行初始化。但是有时候我们确实会期望去学一个stochastic behaviors(鲁棒性会更强，具体参见文末扩展阅读)。

那这样的一种stochastic policy会是optimal policy吗?当我们考虑一个最优的控制和概率推断问题之间的联系的话( consider the connection between optimal control and probabilistic inference)，stochastic policy可以被视为是一种最优的选择(optimal answer )。(Todorov, 2008)

参考：Todorov, E. General duality between optimal control and estimation. In IEEE Conf. on Decision and Control, pp. 4286–4292. IEEE, 2008.
参考：Toussaint, M. Robot trajectory optimization using approximate inference. In Int. Conf. on Machine Learning, pp. 1049–1056. ACM, 2009

直观理解就是，将控制问题作为一个推理的过程(framing control as inference produces policies)，目的不仅仅是为了去产生一个确定性的lowest cost behavior，而是整个low-cost behavior。(Instead of learning the best way to perform the task, the resulting policies try to learn all of the ways of performing the task.)也就是我要找到这个问题所有的“最优解”。

这种方法也可以作为一个困难问题的初始化，比如用这种方法训练一个robot向前走的model，然后这个model作为下次训练robot跳跃、奔跑的初始化参数；在多模态的奖励空间中是一种更好的exploration机制(a better exploration mechanism for seeking out the best mode in a multi-modal reward landscape)；由于behavior的选择变多了，所以在处理干扰的时候，鲁棒性更强。

前人也有一些stochastic policy的一些研究(参考文末资料)，但是大部分都难以用于高维连续动作空间。或者是一些简单的高斯策略分布(very limited)。那能不能去找到一个任意分布的策略分布呢？

作者提出了一种energy-based model(EBM)的方法，energy function为soft Q function。

所采用的方法？

Maximum Entropy Reinforcement Learning

标准的强化学习算法的优化目标为：

$\pi_{\mathrm{std}}^{*}=\arg \max _{\pi} \sum_{t} \mathbb{E}_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \rho_{\pi}}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right]$

Maximum entropy RL算法的优化目标：

$\pi_{\mathrm{MaxEnt}}^{*}=\arg \max _{\pi} \sum_{t} \mathbb{E}_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \rho_{\pi}}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\alpha \mathcal{H}\left(\pi\left(\cdot | \mathbf{s}_{t}\right)\right)\right]$

其中 $\alpha$ 是衡量reward和entropy之间的权重系数。与以往的Boltzman exploration和PGQ算法不一样的地方在于，maximum entropy objective会使得整个trajectory的policy分布的entropy变大。

Soft Value Functions and Energy-Based Models

传统的RL方法一般action是一个单峰的策略分布(unimodal policy distribution，下图中左图所示)，而我们想要探索整个的action分布，很自然的想法就是对其取幂，就变成了一个多峰策略分布 (multimodal policy distribution)。

A multimodal Q-function

Energy based model和soft Q function的关系：

由此作者使用了一种energy-based的policy方法，如下形式：

$\pi\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right) \propto \exp \left(-\mathcal{E}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)$

其中 $\mathcal{E}$ 是energy function，可以用neural network来表示。

Theorem1. Let the soft Q-function be deﬁned ：

定义soft q function：

$\begin{array}{l} Q_{\mathrm{soft}}^{*}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)=r_{t}+ \\ \mathbb{E}_{\left(\mathbf{s}_{t+1}, \ldots\right) \sim \rho_{\pi}}\left[\sum_{l=1}^{\infty} \gamma^{l}\left(r_{t+l}+\alpha \mathcal{H}\left(\pi_{\mathrm{MaxEnt}}^{*}\left(\cdot | \mathbf{s}_{t+l}\right)\right)\right)\right] \end{array}$

和soft value function：

$V_{\mathrm{soft}}^{*}\left(\mathbf{s}_{t}\right)=\alpha \log \int_{\mathcal{A}} \exp \left(\frac{1}{\alpha} Q_{\mathrm{soft}}^{*}\left(\mathbf{s}_{t}, \mathbf{a}^{\prime}\right)\right) d \mathbf{a}^{\prime}$

Maximum entropy RL算法的优化目标：

由此可以得到上述Maximum entropy RL算法的优化目标的 the optimal policy：

$\pi_{\mathrm{MaxEnt}}^{*}\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)=\exp \left(\frac{1}{\alpha}\left(Q_{\mathrm{soft}}^{*}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-V_{\mathrm{soft}}^{*}\left(\mathbf{s}_{t}\right)\right)\right)$

Soft Q Learning中Policy Improvement 证明中有上述公式定义的部分解释(最优策略一定会满足这种energy-based的形式)。

Theorem1将maximum entropy objective和energy-based的方法联系在一起了。其中 $\frac{1}{\alpha} Q_{\mathrm{soft}}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)$ acts as the negative energy。 $\frac{1}{\alpha}V_{soft}(s_{t})$ serve as the log-partition function。

Soft Q function会满足Soft Bellman Equation

$Q_{\mathrm{soft}}^{*}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)=r_{t}+\gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p_{\mathbf{s}}}\left[V_{\mathrm{soft}}^{*}\left(\mathbf{s}_{t+1}\right)\right]$

到此一些基本的定义就定义完成了，之后我们需要将Q-Learning的算法用于maximum entropy policy就可以了。

Training Expressive Energy-Based Models via Soft Q-Learning

通过压缩映射能够证明：

$\begin{aligned} Q_{\mathrm{soft}}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) & \leftarrow r_{t}+\gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p_{\mathrm{s}}}\left[V_{\mathrm{soft}}\left(\mathbf{s}_{t+1}\right)\right], \forall \mathbf{s}_{t}, \mathbf{a}_{t} \\ V_{\mathrm{soft}}\left(\mathbf{s}_{t}\right) & \leftarrow \alpha \log \int_{\mathcal{A}} \exp \left(\frac{1}{\alpha} Q_{\mathrm{soft}}\left(\mathbf{s}_{t}, \mathbf{a}^{\prime}\right)\right) d \mathbf{a}^{\prime}, \forall \mathbf{s}_{t} \end{aligned}$

会收敛到 $Q_{soft}^{*}$ 和 $V_{soft}^{*}$ 。然后这里还是有几个点需要去考虑，比如如何将其用于大规模的state、action空间。从energy-based中采样会变得很棘手(intractable)。

Soft Q Learning

即使证明了soft贝尔曼方程会收敛，但是 $V_{soft}^{*}$ 的计算过程中含有积分项，因此处理起来还是会很困难。作者用function approximator来定义 $Q_{soft}^{\theta}(s,a)$ 。

First，想要用stochastic optimization方法来对上述公式进行优化，我们首先将soft value function通过重要性采样得到其期望的形式：

$V_{\mathrm{soft}}^{\theta}\left(\mathbf{s}_{t}\right)=\alpha \log \mathbb{E}_{q_{\mathbf{a}^{\prime}}}\left[\frac{\exp \left(\frac{1}{\alpha} Q_{\mathrm{soft}}^{\theta}\left(\mathbf{s}_{t}, \mathbf{a}^{\prime}\right)\right)}{q_{\mathbf{a}^{\prime}}\left(\mathbf{a}^{\prime}\right)}\right]$

其中 $q_{a^{\prime}}$ 可以为action space中的任意一个分布。我们可以将soft Q-Iteration 表示为最小化形式：

$J_{Q}(\theta)=\mathbb{E}_{\mathbf{s}_{t} \sim q_{s_{t}}, \mathbf{a}_{t} \sim q_{\mathbf{a}_{t}}}\left[\frac{1}{2}\left(\hat{Q}_{\mathrm{soft}}^{\bar{\theta}}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-Q_{\mathrm{soft}}^{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)^{2}\right]$

其中 $\hat{Q}_{\mathrm{soft}}^{\bar{\theta}}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)=r_{t}+\gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p_{\mathbf{s}}}\left[V_{\mathrm{soft}}^{\theta}\left(\mathbf{s}_{t+1}\right)\right]$ 是target Q-Value。

Approximate Sampling and Stein Variational Gradient Descent (SVGD)

那我们如何从soft q function中采样呢？传统的从energy-based分布中采样通常会有两种策略：1. use Markov chain Monte Carlo (MCMC) based sampling；2. learn a stochastic sampling network trained to output approximate samples from the target distribution . 然而作者依据2016年Liu, Q. and Wang, D.提出的两种方法，a sampling network based on Stein variational gradient descent (SVGD) 和 amortized SVGD.做采样。

Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pp. 2370–2378, 2016.
Wang, D. and Liu, Q. Learning to draw samples: With application to amortized mle for generative adversarial learning. arXiv preprint arXiv:1611.01722, 2016.

这样做的好处主要有三点，提供一个stochastic sample generation；会收敛到EBM精确的后验分布；第三他可以跟actor critic算法联系起来，也就有了之后的SAC。

我们想要去学习一个state-conditioned stochastic neural network $\mathbf{a}_{t}=f^{\phi}\left(\xi ; \mathbf{s}_{t}\right)$ ， $\phi$ 为网络参数， $\xi$ 为高斯或者其他任意一个分布的噪声。想要去寻找一个参数 $\phi$ 下的动作分布 $\pi^{\phi}(a_{t},s_{t})$ ，期望这个分布能够近似energy-based的分布，KL divergence定义如下：

$J_{\pi}\left(\phi ; \mathbf{s}_{t}\right)= D_{K L}\left(\pi^{\phi}\left(\cdot | \mathbf{s}_{t}\right) \| \exp \left(\frac{1}{\alpha}\left(Q_{\text {soft }}^{\theta}\left(\mathbf{s}_{t}, \cdot\right)-V_{\text {soft }}^{\theta}\right)\right)\right)$

Stein variationa lgradient descent如下：

$\begin{aligned} \Delta f^{\phi}\left(\cdot ; \mathbf{s}_{t}\right)= \mathbb{E}_{\mathbf{a}_{t}\sim \pi^{\phi}}[\kappa\left(\mathbf{a}_{t}, f^{\phi}\left(\cdot ; \mathbf{s}_{t}\right)\right) \nabla_{\mathbf{a}^{\prime}} ]Q_{\mathrm{soft}}^{\theta}\left(\mathbf{s}_{t}, \mathbf{a}^{\prime}\right)|_{\mathbf{a}^{\prime}=\mathbf{a}_{t}}\\+\alpha \nabla_{\mathbf{a}^{\prime}} \kappa(\mathbf{a}^{\prime}, f^{\phi}(\cdot ; \mathbf{s}_{t}))|_{\mathbf{a}^{\prime}=\mathbf{a}_{t}}] \end{aligned}$

其中 $\kappa$ 表示核函数， $\Delta f^{\phi}$ 是the optimal direction of the reproducing kernel Hilbert space of $\kappa$ ，使用链导法则和Stein variational gradient into policy network我们有：

$\frac{\partial J_{\pi}\left(\phi ; \mathbf{s}_{t}\right)}{\partial \phi} \propto \mathbb{E}_{\xi}\left[\Delta f^{\phi}\left(\xi ; \mathbf{s}_{t}\right) \frac{\partial f^{\phi}\left(\xi ; \mathbf{s}_{t}\right)}{\partial \phi}\right]$

Soft Q Learning算法伪代码

取得的效果？

验证 multi-modal behavior实验结果

Soft Q会考虑更多的最优策略

所出版信息？作者信息？

这篇文章是ICML2017上面的一篇文章。第一作者Tuomas Haarnoja是Google DeepMind的research Scientist。

在这里插入图片描述

参考链接

代码链接：https://github.com/haarnoja

扩展阅读

为什么要使用Stochastic Policy

在有些情况下我们需要去学习一个stochastic policy，为什么要去学这样一个stochastic policy呢？作者举例了两点理由：

exploration in the presence of multimodal objectives(多模态的信息来源), and compositionality attained via pretraining. (Daniel et al., 2012)
增加在不确定环境下的鲁棒性(Ziebart,2010)，在模仿学习中(Ziebartetal.,2008)，改善收敛性和计算性能( improved convergence and computational properties) (Gu et al., 2016a)

参考文献1：Daniel, C., Neumann, G., and Peters, J. Hierarchical relative entropy policy search. In AISTATS, pp. 273–281, 2012.
参考文献2：Ziebart,B.D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. PhD thesis, 2010.
参考文献3：Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artiﬁcial Intelligence, pp. 1433– 1438, 2008.
参考文献4：Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine,S. Q-prop: Sample-efﬁcientpolicygradientwith an off-policy critic. arXiv preprint arXiv:1611.02247, 2016a.

前人在 maximum entropy stochastic policy上的研究

Z-learning (Todorov, 2007)；

Todorov, E. Linearly-solvable Markov decision problems. In Advances in Neural Information Processing Systems, pp. 1369–1376. MIT Press, 2007.

maximum entropy inverse RL(Ziebartetal.,2008)；

Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artiﬁcial Intelligence, pp. 1433– 1438, 2008.

approximate inference using message passing (Toussaint, 2009)；

Toussaint, M. Robot trajectory optimization using approximate inference. In Int. Conf. on Machine Learning, pp. 1049–1056. ACM, 2009.

$\Psi$ -learning (Rawlik et al., 2012)；

Rawlik, K., Toussaint, M., and Vijayakumar, S. On stochastic optimal control and reinforcement learning by approximate inference. Proceedings of Robotics: Science and Systems VIII, 2012.

G-learning (Fox et al., 2016),

Fox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates. In Conf. on Uncertainty in Artiﬁcial Intelligence, 2016.

PGQ (O’Donoghue et al., 2016)；recent proposals in deep RL

O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. PGQ: Combining policy gradient and Q-learning. arXiv preprint arXiv:1611.01626, 2016

微信公众号

公众号介绍：主要研究分享深度学习、机器博弈、强化学习等相关内容！期待您的关注，欢迎一起学习交流进步！

最后编辑于：2020.03.25 10:20:33

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 216,591评论 6赞 501
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 92,448评论 3赞 392
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 162,823评论 0赞 353
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,204评论 1赞 292
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,228评论 6赞 388
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,190评论 1赞 299
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,078评论 3赞 418
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,923评论 0赞 274
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,334评论 1赞 310
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,550评论 2赞 333
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,727评论 1赞 348
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,428评论 5赞 343
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,022评论 3赞 326
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,672评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,826评论 1赞 269
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 47,734评论 2赞 368
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,619评论 2赞 354