分布算法目前是强化学习的有趣的发现。以此为基础可以构造更具严格理论支持的强化学习算法。本系列给出最近 Google Brain 团队的工作,首次给出结合函数近似的分布强化学习收敛性的理论证明。
Distributional reinforcement learning with linear function approximation
Authors: Marc G. Bellemare, Nicolas Le Roux, Pablo Samuel Castro and Subhodeep Moitra
Institute: Google Brain
Despite many algorithmic advances, our theoretical understanding of practical distributional reinforcement learning methods remains limited. One exception is Rowland et al. (2018)’s analysis of the C51 algorithm in terms of the Cram´er distance, but their results only apply to the tabular setting and ignore C51’s use of a softmax to produce normalized distributions. In this paper we adapt the Cram´er distance to deal with arbitrary vectors. From it we derive a new distributional algorithm which is fully Cramer based and can be combined to linear function approximation, with formal guarantees in the context of policy evaluation. In allowing the model’s prediction to be any real vector, we lose the probabilistic interpretation behind the method, but otherwise maintain the appealing properties of distributional approaches. To the best of our knowledge, ours is the first proof of convergence of a distributional algorithm combined with function approximation. Perhaps surprisingly, our results provide evidence that Cramer-based distributional methods may perform worse than directly approximating the value function.
尽管有许多算法进步,但我们对实用分布强化学习方法的理论理解仍然有限。有一个例外是 Rowland等人 2018 年研究工作,根据Cramer 距离对 C51 算法进行了分析,但是它们的结果仅适用于表的设定,并忽略 C51 对 softmax 来产生归一化分布的使用。在本文中,我们调整 Cramer 距离来处理任意向量。从中我们推导出一种新的分布算法,该算法完全基于 Cramer 距离并且可以组合成线性函数近似,在策略评估的背景下具有形式化保证。在允许模型的预测为任何实数向量时,我们失去了该方法背后的概率解释,但在其他方面保持了分布方法的吸引人的特性。据我们所知,我们是分布算法与函数近似结合收敛性的第一个证明。可令人惊讶的是,我们的结果提供了证据证明基于 Cramer 距离的分布方法可能比直接近似值函数表现更差。