Chapter 6

Chapter 6: Temporal-Difference Learning

Temporal-difference (TD) learning is a combination of DP ideas and MC ideas. Like MC, it learns from sample experience without a model of the environment's dynamics. Like DP, it performs value estimation with bootstrap.
We can see that DP, MC and TD all follow the GPI framework, and they all share the same greedy policy improvement strategy, while their main differences lie in how they estimate the value functions of the policy.

TD Prediction

Both TD and constant-\alpha MC follow the same update rule with different tagret. For MC update, the target is G_t, and the update rule is V(S_t) \leftarrow V(S_t) + \alpha [G_t - V(S_t)]. For TD(0) update (TD(0) is a special case of TD(\lambda), which will be covered in later chapters), the target is R_{t+1} + \gamma V(S_{t+1}), and the update rule is V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]. Both of them are sample updates because they are based on a single sample successor rather than on a complete distribution of all possible successors, which is called expected updates as in DP.
The advantages of TD compared to MC are mainly twofold. First, TD is fully incremental in step, while MC can only update after a full episode. Thus TD is usually more efficient and can also work on continuous tasks. Second, TD utilizes the structure of the MDP. Under the batch learning setting, MC actually finds the estimation which minimizes mean-squared error on the training set, while batch TD finds the estimation that would be exactly correct for the maximum-likelihood model of the MDP.
The main issue with TD is that it bootstraps, which introduces bias and many other related problems.

Sarsa: On-policy TD Control

Sarsa is an on-policy TD method which learns from a quintuple of (S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}), and the update rule is Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]. Sarsa is on-policy because the update rule in expectation learns a target policy which is identity to the behavior policy.

Q-learning: Off-policy TD Control

Q-learning learns a deterministic target policy and samples episodes with an \epsilon-greedy behavior policy based on the learned Q values. The update rule is Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)]. An interesting thing here is that although Q-learning is off-policy, we don't have importance sampling here as introduced in MC. The reason is that the sampling process only happens at step t, thus will lead to an importance sampling ratio of 1.

Expected Sarsa

I think Expected Sarsa can be seen as a general form of Q-learning, where its learning target is the expected value over the next state's actions (equal to V_\pi(S_{t+1})?), i.e., Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \sum_a \pi(a|S_{t+1}) Q(S_{t+1}, a) - Q(S_t, A_t)]. It is equivalent to Q-learning if the target policy \pi is deterministic.

Maximization Bias and Double Learning

In both on-policy and off-policy learning, we use Q(\arg\max_a Q(a)) to estimate the value of the optimal action q(A^*), where A^* = \arg\max_a Q(a). This will introduce maximization bias, and double learning is a good way to avoid it. Specifically, we learn two value functions Q_1 and Q_2. We select the optimal action with Q_1, i.e., A^* = \arg\max_a Q_1(a), and estimate its value with Q_2. In this way, we have that \mathbb{E}[Q_2(A^*)] = q(A^*), as the action selection process is indepedent of the value estimation from Q_2.

Approximations in TD

Most approximations in RL comes from how different methods approximate the true learning target \mathbb{E}[G_t]. The methods used by TD have the following intrisic approximations.

  1. Sample update instead of expected update (even expected Sarsa has to sample S_{t+1}).
  2. Bootstrap.
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容