Author:David Silver
Outline
- Markov Processes
- Markov Reward Processes
- Markov Decision Processes
- Extensions to MDPs
Introduction to MDPs
- Markov decision processes formally describe
an
environment` for reinforcement learning - Where the environment is fully observable
- i.e. The current state completely
characterises
the process - Almost all RL problems can be formalised as MDPs, e.g.
- Optimal control primarily deals with
continuous
MDPs - Partially observable problems can be converted into MDPs
- Bandits are MDPs with one state
- Optimal control primarily deals with
Markov Property
“The future is independent of the past given the present”
- The state captures all
relevant
information from the history - Once the state is known, the history may be
thrown away
- i.e. The state is a
sufficient statistic
of the future
State Transition Matrix
For a Markov state s
and successor state s'
, the state transition probability is defined by
State transition matrix defines transition probabilities from all states s to all successor states s'
,
where each row of the matrix sums to 1
.
Markov Process
A Markov process is a memoryless random process, i.e. a sequence of random states with the Markov property.
Example: Student Markov Chain
Example: Student Markov Chain Episodes
Example: Student Markov Chain Transition Matrix
Markov Reward Process
A Markov reward process
is a Markov chain with values.
Example: Student MRP
Return
- The discount is the present value of future rewards
- The value of receiving reward after time-steps is .
- This values immediate reward above delayed reward.
-
close to
0
leads to”myopic”
evaluation
-
close to
-
close to
1
leads to”far-sighted”
evaluation
Why discount?
Most Markov reward and decision processes are discounted. Why?
- Mathematically convenient to discount rewards
- Avoids infinite returns in cyclic Markov processes
- Uncertainty about the future may not be fully represented
- If the reward is financial, immediate rewards may earn more interest than delayed rewards
- Animal/human behaviour shows preference for immediate reward
- It is sometimes possible to use undiscounted Markov reward processes (i.e. ),
e.g. if all sequences terminate
.
Value Function
The value function gives the long-term value of state .
Example: Student MRP Returns
Example: State-Value Function for Student MRP (1)
Example: State-Value Function for Student MRP (2)
Example: State-Value Function for Student MRP (3)
Bellman Equation for MRPs
The value function can be decomposed
into two parts:
- immediate reward
- discounted value of successor state
Bellman Equation for MRPs (2)
Example: Bellman Equation for Student MRP
Bellman Equation in Matrix Form
The Bellman equation can be expressed concisely using matrices,
where is a column vector with one entry per state
Solving the Bellman Equation
- The Bellman equation is a linear equation
- It can be solved directly:
-
Computational complexity
is for n states -
Direct solution
only possible for small MRPs - There are
many iterative methods
for large MRPs, e.g.- Dynamic programming
- Monte-Carlo evaluation
- Temporal-Difference learning
Markov Decision Process
A Markov decision process (MDP)
is a Markov reward process with decisions. It is an environment in which all states are Markov.
Example: Student MDP
Policies (1)
- A policy fully defines the behaviour of an agent
- MDP policies depend on the current state
(not the history)
- i.e. Policies are stationary
(time-independent)
,
Policies (2)
- Given an MDP and a policy
- The state sequence is a Markov process
- The state and reward sequence is a Markov reward process
- where
Value Function
Example: State-Value Function for Student MDP
Bellman Expectation Equation
The state-value function
can again be decomposed
into immediate reward plus discounted value of successor state,
The action-value function
can similarly be decomposed,
Bellman Expectation Equation for
Bellman Expectation Equation for
Bellman Expectation Equation for (2)
Bellman Expectation Equation for (2)
Example: Bellman Expectation Equation in Student MDP
Bellman Expectation Equation (Matrix Form)
The Bellman expectation equation can be expressed concisely using the induced MRP,
with direct solution
Optimal Value Function
- The optimal value function specifies the
best possible performance
in the MDP. - An MDP is
“solved”
when we know theoptimal value fn (v+q)
.
Example: Optimal Value Function for Student MDP
Example: Optimal Action-Value Function for Student MDP
Optimal Policy
Define a partial ordering over policies:
Finding an Optimal Policy
An optimal policy can be found by maximising over ,
- There is always
a deterministic optimal policy
for any MDP - If we know , we immediately
have the optimal policy
Example: Optimal Policy for Student MDP
Bellman Optimality Equation for
Bellman Optimality Equation for
Bellman Optimality Equation for (2)
Bellman Optimality Equation for (2)
Example: Bellman Optimality Equation in Student MDP
Solving the Bellman Optimality Equation
- Bellman Optimality Equation is
non-linear
- No closed form solution
(in general)
- Many iterative solution methods
- Value Iteration
- Policy Iteration
- Q-learning
- Sarsa
Extensions to MDPs (no exam)
Infinite and continuous MDPs
Partially observable MDPs
Undiscounted, average reward MDPs
Infinite MDPs (no exam)
The following extensions are all possible:
- Countably infinite state and/or action spaces
- Straightforward
- Continuous state and/or action spaces
- Closed form for linear quadratic model (LQR)
- Continuous time
- Requires partial differential equations
- Hamilton-Jacobi-Bellman (HJB) equation
- Limiting case of Bellman equation as time-step → 0
POMDPs (no exam)
Belief States (no exam)
Reductions of POMDPs (no exam)
Ergodic Markov Process (no exam)
Ergodic MDP (no exam)
Average Reward Value Function (no exam)
Questions?
The only stupid question is the one you were afraid to ask but never did.
-Rich Sutton
Reference:《UCL Course on RL》