Question: Why bad idea?
Answer: Don't gain information every step
in theory, any optimization method can be used here. but for this particular model-based rl case, some are better than others.
eg. 1st order GD is not a good idea
for now -> derivative-free method
http://web.mit.edu/6.454/www/www_fall_2003/gew/CEtutorial.pdf
???
看上去是两回事
here -> low variance, compared with vanilla PG -> only care about the sort, rather than the numerical value
MCTS -> game planning -> handles stochasticity very well
number of time step can be very large
search to a certain depth (say, 3 here), and then just randomly play the game to the end
idea is a random policy from that state has better outcome -> the state has higher value
Question: Can we have a better policy, replacing the random policy
Answer: Yes. eg. a policy from NN. Actually MCTS can be improved in lots of sense.
MCTS with better action policy -> better estimation of the value
-> in reality, random policy is preferred, probably because of its simplicity
-> also, for a small problem, random policy is not bad
Question: How about continuous space?
Answer: infinite number of action -> will discuss later
(Bayesian optimization ?)
Additional reading
- Browne, Powley, Whitehouse, Lucas, Cowling, Rohlfshagen, Tavener,
Perez, Samothrakis, Colton. (2012). A Survey of Monte Carlo Tree
Search Methods.
• Survey of MCTS methods and basic summary
a 6 years ago paper
state changing at the beginning has larger effect
numerically unstable -> Hessian Matrix -> ill conditioned -> extremely sensitive to some parameters, insensitive to others
shooting method: take all actions and then BP ->
for shooting methods, instead of GD, use a method similar to 2nd order Newton's method, without building a full Hessian
assume f
is a linear function
Newton's method has 2nd order term of dynamics
Both iLQR and Newton't method converge, at the same rate
Additional reading
- Mayne, Jacobson. (1970). Differential dynamic programming.
• Original differential dynamic programming algorithm. - Tassa, Erez, Todorov. (2012). Synthesis and Stabilization of Complex
Behaviors through Online Trajectory Optimization.
• Practical guide for implementing non-linear iterative LQR. - Levine, Abbeel. (2014). Learning Neural Network Policies with Guided
Policy Search under Unknown Dynamics.
• Probabilistic formulation and trust region alternative to deterministic line search.
trajectory optimization does a great job, with a good model