We now know the most important thing for computing an optimal policy is to compute the value function. But how? (The following contents are all based on infinite horizon problems.) The solution to this problem can be roughly divided into two categories: Value Iteration and Policy Iteration.
Value Iteration
Based on the formula of expected utility with discounted factor:
and the transition function defined in MDP model, there is an equation for the value function to intutively statisfy:
which is called Bellman equation.
Unfortunately, it's very difficult to solve the Bellman euqation, since there is a operator, so that's why we need value iteration method.
Based on the Bellman equation, we can get the Bellman updata:
Where represents the iteration time steps.
The value iteration algorithm can be described as following:
We can initialize all utilities for all states as 0, and using the Bellman update formula to compute new utilities step by step until it converges (all values reach unique fixed points. This will save much more time than solve the Bellman equations directly.
Policy Iteration
Now, think about this, we are updating the values/utilities for each state in the first method, but in policy iteration, we initialize and update the policies. This is based on that sometimes, to find the optimal policy, we don't really need to find the highly accurate value function, e.g. if one action is clearly better than others. So we give up computing the values using Bellman update, we initialize the policies as and then update them. To do this, we alternate the following two main steps:
- Policy evaluation: given the current policy , calculate the next-step utilities :
This update formula is similar but simpler than Bellman update formula, there is no need to compute the maximum value among all possible actions, but using the actions given by the policy at time step .
- Policy inrovement: Using the calculated one step look-ahead utilities to compute new policy .
To improve the policy, we need to choose another better policy to replace the current one, to do so, we need to introduce the action-value function or Q-function for policy :
The main different for Q-function in comparison to the value function is that the Q-function is the expected utility after determining the exact action . Suppose the size of the state space and action space is and respectively, then for each policy , there will be value functions, one for each state, and will be Q-functions, one for each state-action pair.
The policy improvement theorem can be very easy then: suppose a new policy that for all :
Then for all , there will be:
In this case, we can say that is better than , so we can improve the policy from to .
Alternatively do the above two steps until there is no change in policy, we can get the optimal policy, this is the policy iteration algorithm, describing as following:
More Accurate Policy Iteration Methods
The policy iteration method stated above with both doing policy evaluation and policy improvement one step is called generalized policy iteration, which is a very general and common method. However, there are some other more accurate methods based on more accurate policy evaluation.
- For each step of policy evaluation, not only update the utilities one time step, but using the following set of equations to solve the accurate utilities:
For states, there are equations and they can be solved in time using some basic linear algebra knowledge. This will be much more complex but much more accurate, however, it's a bit too much for almost all problems, we ususlly don't need the most accurate utilities.
- Another method does steps iterations in the policy evaluation step instead of to convergence or only one step, which is called modified policy iteration. can be defined according to the environment and the problem.