强化学习整理-经典方法之TD Learning

Temporal-Difference Learning

1. TD(0)

V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]

TD error \delta_t:
\delta_t \doteq R_{t+1} + \gamma V(S_{t+1}) - V(S_t)


2. Sarsa

Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]


3. Q-learning

Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \mathop{max}_a Q(S_{t+1}, a) - Q(S_t, A_t)]


4. Expected Sarsa

\begin{eqnarray} Q(S_t, A_t) &\leftarrow& Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \mathbb{E}[Q(S_{t+1}, A_{t+1}) | S_{t+1}] - Q(S_t, A_t)] \\ ~&=&Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \sum_a \pi(a|S_{t+1})Q(S_{t+1}, a) - Q(S_t, A_t)] \end{eqnarray}


5. Double Q-learning

设置两个Q网络,Q_1Q_2,等概率执行以下两个更新中的一个:
Q_1(S_t, A_t) \leftarrow Q_1(S_t, A_t) + \alpha [R_{t+1} + \gamma Q_2(S_{t+1}, \mathop{argmax}_a Q_1(S_{t+1}, a)) - Q_1(S_t, A_t)] \\ Q_2(S_t, A_t) \leftarrow Q_2(S_t, A_t) + \alpha [R_{t+1} + \gamma Q_1(S_{t+1}, \mathop{argmax}_a Q_2(S_{t+1}, a)) - Q_2(S_t, A_t)]


6. \textit{n}-step TD

可以看作是TD(0)和MC方法的一种trade-off
G_{t: t+n} \doteq R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n}), n \geq 1, 0 \leq t < T - n \\ V_{t+n}(S_t) \doteq V_{t+n-1}(S_t) + \alpha [G_{t: t+n} - V_{t+n-1}(S_t)], 0 \leq t < T


7. \textit{n}-step Sarsa

G_{t: t+n} \doteq R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^n Q_{t+n-1}(S_{t+n}, A_{t+n}), n \geq 1, 0 \leq t < T - n \\ Q_{t+n}(S_t, A_t) \doteq Q_{t+n-1}(S_t, A_t) + \alpha [G_{t: t+n} - Q_{t+n-1}(S_t, A_t)], 0 \leq t < T


8. Off-policy \textit{n}-step Sarsa

G_{t: t+n} \doteq R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^n Q_{t+n-1}(S_{t+n}, A_{t+n}), n \geq 1, 0 \leq t < T - n \\ \rho_{t: h} = \prod_{k=t}^{min(h, T-1)} \frac{\pi(A_k | S_k)}{b(A_k | S_k)} \\ Q_{t+n}(S_t, A_t) \doteq Q_{t+n-1}(S_t, A_t) + \alpha \rho_{t+1: t+n} [G_{t: t+n} - Q_{t+n-1}(S_t, A_t)], 0 \leq t < T


9. TD(\lambda)

The off-line \lambda-return algorithm (forward view):
G_{t: t+n} \doteq R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{v}(S_{t+n}, w_{t+n-1}), n \geq 1, 0 \leq t < T - n \\ G_t^{\lambda} \doteq (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t: t+n} \\ G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} G_{t: t+n} + \lambda^{T-t-1} G_t \\ w_{t+1} \doteq w_t + \alpha [G_t^{\lambda} - \hat{v}(S_t, w_t)] \nabla\hat{v} (S_t, w_t), t = 0, \dots, T-1

The semi-gradient TD(\lambda), 引入eligibility向量z_t (backward view):
z_{-1} \doteq 0 \\ z_t \doteq \gamma \lambda z_{t-1} + \nabla \hat{v} (S_t, w_t), 0 \leq t < T \\ \delta_t \doteq R_{t+1} + \gamma \hat{v}(S_{t+1}, w_t) - \hat{v}(S_t, w_t) \\ w_{t+1} \doteq w_t + \alpha \delta_t z_t


10. Sarsa(\lambda)

TD methods for action values, Sarsa(\lambda) (forward view):
G_{t: t+n} \doteq R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{q}(S_{t+n}, A_{t+n}, w_{t+n-1}), n \geq 1, 0 \leq t < T - n \\ G_t^{\lambda} \doteq (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t: t+n} \\ G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} G_{t: t+n} + \lambda^{T-t-1} G_t \\ w_{t+1} \doteq w_t + \alpha [G_t^{\lambda} - \hat{q}(S_t, A_t, w_t)] \nabla \hat{q} (S_t, A_t, w_t), t = 0, \dots, T-1

With eligibility trace (backward view):
z_{-1} \doteq 0 \\ z_t \doteq \gamma \lambda z_{t-1} + \nabla \hat{q} (S_t, A_t, w_t), 0 \leq t < T \\ \delta_t \doteq R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, w_t) - \hat{q}(S_t, A_t, w_t) \\ w_{t+1} \doteq w_t + \alpha \delta_t z_t


最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容