Question 1
- Solution
Thus, although there exists constant, it doesn't affect the optimal policy. That is, there is a constant difference between two models for state value, however the optimal action is the same. So in infinite MDP we can generalize that
Question 2
- Solution
We denote subscript as the horizon. Like
means the state of
Again for each state value offrom
to
, there exists same but varied difference between two models. Thus, the optimal policy under two models are same.
Question 3.1
- Solution
For reward is -1 per step, the optimal policy will choose the shortest paths.
For reward is 0 per step, it is trivial.
For reward is +1 per step, the optimal policy will choose the longest paths.
To sum up, in the case of indefinite-horizon MDP, the optimal policy will be affected if all rewards are added some numbers.
Question 3.2
- Solution
Convert indefinite-horizon MDP into finite-horizon MDP
Let's assume the max horizon length is
. For those trajectories that its length is strictly smaller than
, it can add some absorbing states into its trajectory with reward zero such that its horizon length is
. Note that the additional absorbing states will not change the primitive policy because it will not change the value function for this trajectory.
Add rewards like +1 or +2.
If these rewards do not be added into absorbing state, final result is the same as Q3.1
Question 4
- Solution
- Stationary MDP can be viewed as non-stationary MDP with
and
fixed along horizon.
- We can augment the state representation by introducing horizon
, which means that
. And we can rewrite new transition probability as
and new reward function as
. In this way,
is stationary. Finally, the size of new state space is
.
- At the first glance, if we want to convert non-stationary dynamics into stationary as the above method indicates, the size of new state space will be infinite, which is trivial.