Chapter 6: Temporal-Difference Learning

Temporal-difference (TD) learning is a combination of DP ideas and MC ideas. Like MC, it learns from sample experience without a model of the environment's dynamics. Like DP, it performs value estimation with bootstrap.
We can see that DP, MC and TD all follow the GPI framework, and they all share the same greedy policy improvement strategy, while their main differences lie in how they estimate the value functions of the policy.

TD Prediction

Both TD and constant- $\alpha$ MC follow the same update rule with different tagret. For MC update, the target is $G_t$ , and the update rule is $V(S_t) \leftarrow V(S_t) + \alpha [G_t - V(S_t)].$ For TD(0) update (TD(0) is a special case of TD( $\lambda$ ), which will be covered in later chapters), the target is $R_{t+1} + \gamma V(S_{t+1})$ , and the update rule is $V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)].$ Both of them are sample updates because they are based on a single sample successor rather than on a complete distribution of all possible successors, which is called expected updates as in DP.
The advantages of TD compared to MC are mainly twofold. First, TD is fully incremental in step, while MC can only update after a full episode. Thus TD is usually more efficient and can also work on continuous tasks. Second, TD utilizes the structure of the MDP. Under the batch learning setting, MC actually finds the estimation which minimizes mean-squared error on the training set, while batch TD finds the estimation that would be exactly correct for the maximum-likelihood model of the MDP.
The main issue with TD is that it bootstraps, which introduces bias and many other related problems.

Sarsa: On-policy TD Control

Sarsa is an on-policy TD method which learns from a quintuple of $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$ , and the update rule is $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)].$ Sarsa is on-policy because the update rule in expectation learns a target policy which is identity to the behavior policy.

Q-learning: Off-policy TD Control

Q-learning learns a deterministic target policy and samples episodes with an $\epsilon$ -greedy behavior policy based on the learned Q values. The update rule is $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)].$ An interesting thing here is that although Q-learning is off-policy, we don't have importance sampling here as introduced in MC. The reason is that the sampling process only happens at step $t$ , thus will lead to an importance sampling ratio of 1.

Expected Sarsa

I think Expected Sarsa can be seen as a general form of Q-learning, where its learning target is the expected value over the next state's actions (equal to $V_\pi(S_{t+1})$ ?), i.e., $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \sum_a \pi(a|S_{t+1}) Q(S_{t+1}, a) - Q(S_t, A_t)].$ It is equivalent to Q-learning if the target policy $\pi$ is deterministic.

Maximization Bias and Double Learning

In both on-policy and off-policy learning, we use $Q(\arg\max_a Q(a))$ to estimate the value of the optimal action $q(A^*)$ , where $A^* = \arg\max_a Q(a)$ . This will introduce maximization bias, and double learning is a good way to avoid it. Specifically, we learn two value functions $Q_1$ and $Q_2$ . We select the optimal action with $Q_1$ , i.e., $A^* = \arg\max_a Q_1(a)$ , and estimate its value with $Q_2$ . In this way, we have that $\mathbb{E}[Q_2(A^*)] = q(A^*)$ , as the action selection process is indepedent of the value estimation from $Q_2$ .

Approximations in TD

Most approximations in RL comes from how different methods approximate the true learning target $\mathbb{E}[G_t]$ . The methods used by TD have the following intrisic approximations.

Sample update instead of expected update (even expected Sarsa has to sample $S_{t+1}$ ).
Bootstrap.

Chapter 6