在大多是强化学习(reinforcement learning RL)问题中,环境的model
都是未知的,也就无法直接做动态规划。一种方法是去学MDP
,在这个系列的理解强化学习中的策略迭代和值迭代
这篇文章中有具体思路。但这种做法还是会存在很多问题,就是在sample
过程中会比较麻烦,如果你随机sample
的话就会有某些state
你很难sample
到,而按照某种策略sample
的话,又很难得到真实的转移概率。一旦你的model
出现了问题,值迭代和策略迭代都将会出现问题。
于是就有了Model-free Reinforcement Learning
,直接与环境交互,直接从数据中学到model
。
Model-free Reinforcement Learning
Model-free Reinforcement Learning
需要从数据中estimate
出value
是多少(state or state-action pair),接下来拿到cumulative reward
的期望,得到这些case
之后,再去做model-free
的control
,去optimal
当前的policy
使得value function
最大化。
那model-free
的value function
如何来做prediction
呢?
在model-free
的RL
中我们无法获取state transition
和reward function
,我们仅仅是有一些episodes
。之前我们是拿这些episodes
学model
,在model free
的方法中拿这些episode
直接学value function
或者是policy
,不需要学MDP
。这里面两个关键的key steps
:1. estimate value function. 2. optimize policy.
Value Function Estimate
In model-based RL (MDP), the value function is calculated by dynamic programming
在model free
的方法中,我们不知道state transition
,由此无法计算上述等式的期望。
Monte-Carlo Methods
Monte-Carlo methods are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. For example, to calculate the circle's surface. As show in following figure:
对上述方框中均匀撒上一些点,然后用如下等式计算即可:
Monte-Carlo Value Estimation
我们有很多episodes
,基于这些episode
,我们去计算total discounted reward
:
Value function
的 expected return
可表示为如下数学形式:
上述方法可总结为两步:1. 使用policy
从state
开始采样 个episodes
。2. 计算平均累计奖励(the average of cumulative reward )。可以看出来,这种基于采样的方法,直接一步到位,计算value
而不需要计算MDP
中的什么状态转移啥的。
上述思想更加细致、更具体的方法可用如下形式表示:
- Sample episodes of policy 。
- Every time-step that state is visited in an episode
- Increment counter
- Increment total return
- Value is estimated by mean return
- By law of large numbers as 。
Incremental Monte-Carlo Updates
- Update incrementally after each episode
- For each state with cumulative return
- For non-stationary problems (i.e. the environment could be varying over time), it can be useful to track a running mean, i.e. forget old episodes
如果环境的state transition
和reward function
一直在变,我们把这个环境叫做non-stationary
,环境本身肯定叫做stochastic
环境。但是如果分布不变,叫做statically environment
,但是环境本身的分布会发生变化的话,就需要去忘掉一些老的episode
,如果用平均的方法去做的话,老的episode
和新的episode
一样,它就忘不掉老的episode
。
Monte-Carlo Value Estimation
的一些特点:
- MC methods learn directly from episodes of experience
- MC is model-free: no knowledge of MDP transitions / rewards
- MC learns from complete episodes: no bootstrapping (discussed later)
- MC uses the simplest possible idea: value = mean return
- Caveat: can only apply MC to episodic MDPs i.e., all episodes must terminate
Temporal-Difference Learning
TD
的方法中引入对未来值函数的估计:
TD
的算法主要有以下四个特点:
- TD methods learn directly from episodes of experience
- TD is model-free: no knowledge of MDP transitions / rewards
- TD learns from incomplete episodes, by bootstrapping
- TD updates a guess towards a guess
Monte Carlo vs. Temporal Difference
Monte Carlo
方法和Temporal Difference
方法对比如下:
The same goal: learn from episodes of experience under policy 。
-
Incremental every-visit Monte-Carlo
- Update value toward actual return 。
- Simplest temporal-difference learning algorithm: TD
- Update value toward estimated return 。
- TD Target:;
- TD error:
Advantages and Disadvantages of MC vs. TD
-
TD can learn before knowing the final outcome
- TD can learn online after every step
- MC must wait until end of episode before return is known
-
TD can learn without the final outcome
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments
Bias/Variance Trade-Off
- Return is unbiased estimate of 。
基于当前的策略去采样,然后计算平均值,这样得到的估计是无偏估计。
- TD target is biased estimate of 。
TD target
中由于存在对未来的估计,这个估计如果是非常准确的,那TD target
也是unbiased estimate
,但是由于很难估计准确,所以是 biased estimate
。
- TD target is of much lower variance than the return
TD target的方法一般比Return 要小。Return depends on many random actions, transitions and rewards;TD target depends on one random action, transition and reward
Advantages and Disadvantages of MC vs. TD (2)
- MC has high variance, zero bias
MC
方法具有好的 convergence properties (even with function approximation) 并且 Not very sensitive to initial value 但是需要 Very simple to understand and use。需要多采样去降低variance
。
- TD has low variance, some bias
TD
的方法 Usually more efficient than MC ,TD converges to ,but not always with function approximation。并且 More sensitive to initial value than MC。
n-step model-free prediction
For time constraint, we may jump n-step prediction section and directly head to model-free control
- Define the n-step return
- n-step temporal-difference learning
有了值函数之后,我们就需要去做策略改进了。
我的微信公众号名称:深度学习与先进智能决策
微信公众号ID:MultiAgent1024
公众号介绍:主要研究分享深度学习、机器博弈、强化学习等相关内容!期待您的关注,欢迎一起学习交流进步!