Temporal-Difference Learning
1. TD(0)
TD error :
2. Sarsa
3. Q-learning
4. Expected Sarsa
5. Double Q-learning
设置两个Q网络,和
,等概率执行以下两个更新中的一个:
6.
-step TD
可以看作是TD(0)和MC方法的一种trade-off
7.
-step Sarsa
8. Off-policy
-step Sarsa
9. TD(
)
The off-line -return algorithm (forward view):
The semi-gradient TD(), 引入eligibility向量
(backward view):
10. Sarsa(
)
TD methods for action values, Sarsa() (forward view):
With eligibility trace (backward view):