https://classroom.udacity.com/courses/ud501/lessons/5326212698/concepts/54629888620923
hallucinate 产生幻觉
Dyna-Q:混合 Model-Free 和 Model-based
每一次和真实世界的交互,都会自己更新100次。
T'[s,a,s']: 从状态 s,采取动作 a,到状态 s’的概率
R'[s,a]: 从状态 s,采取动作 a的 reward
根据真实世界发生的次数,更新 T
练习: How To Evaluate T?
Type in your expression usingMathQuill
- a WYSIWYG math renderer that understands LaTeX.
Correction: The expression should be:
R:模型中的 Reward
r: 真实的立即 reward
Summary
The Dyna architecture consists of a combination of:
- direct reinforcement learning from real experience tuples gathered by acting in an environment,
- updating an internal model of the environment, and,
- using the model to simulate experiences.
Sutton and Barto.
Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. [web]
Resources
-
Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In
Proceedings of the Seventh International Conference on Machine Learning, Austin, TX, 1990. [pdf]
-
Sutton and Barto.
Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. [web]
-
(videos, slides)
- Lecture 8: Integrating Learning and Planning [pdf]