Lecture 1: Introduction to Reinforcement Learning

Author：David Silver

Outline

Admin
About Reinforcement Learning
The Reinforcement Learning Problem
Inside An RL Agent
Problems within Reinforcement Learning

Many Faces of Reinforcement Learning

Branches of Machine Learning

Characteristics of Reinforcement Learning

What makes reinforcement learning different from other machine learning paradigms?

There is no supervisor, only a reward signal
Feedback is delayed, not instantaneous
Time really matters (sequential, non i.i.d data)
Agent’s actions affect the subsequent data it receives

Examples of Reinforcement Learning

Fly stunt manoeuvres in a helicopter
Defeat the world champion at Backgammon
Manage an investment portfolio
Control a power station
Make a humanoid robot walk
Play many different Atari games better than humans

Rewards

A reward $R_t$ is a scalar feedback signal
Indicates how well agent is doing at step t
The agent’s job is to maximise cumulative reward

Reinforcement learning is based on the reward hypothesis

Do you agree with this statement?

Examples of Rewards

Fly stunt manoeuvres in a helicopter
- +ve reward for following desired trajectory
- −ve reward for crashing
Defeat the world champion at Backgammon
- +/−ve reward for winning/losing a game
Manage an investment portfolio
- +ve reward for each $ in bank
Control a power station
- +ve reward for producing power
- −ve reward for exceeding safety thresholds
Make a humanoid robot walk
- +ve reward for forward motion
- −ve reward for falling over
Play many different Atari games better than humans
- +/−ve reward for increasing/decreasing score

Sequential Decision Making

Goal: select actions to maximise total future reward
Actions may have long term consequences
Reward may be delayed
It may be better to sacrifice immediate reward to gain more long-term reward
Examples:
- A financial investment (may take months to mature)
- Refuelling a helicopter (might prevent a crash in several hours)
- Blocking opponent moves (might help winning chances many moves from now)

Agent and Environment

History and State

The history is the sequence of observations, actions, rewards :
$H_t=O_1,R_1,A_1,...A_{t-1},O_t,R_t$
i.e. all observable variables up to time t
i.e. the sensorimotor stream of a robot or embodied agent
What happens next depends on the history:
- The agent selects actions
- The environment selects observations/rewards
State is the information used to determine what happens next
Formally, state is a function of the history:
$S_t=f(H_t)$

Environment State

Agent State

Information State

An information state (a.k.a. Markov state) contains all useful information from the history.

“The future is independent of the past given the present”
$H_{1:t}\longrightarrow S_t\longrightarrow H_{t+1:\infty}$
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future
The environment state $S_t^e$ is Markov
The history $H_t$ is Markov

Rat Example

Fully Observable Environments

Partially Observable Environments

Partial observability: agent indirectly observes environment:
- A robot with camera vision isn’t told its absolute location
- A trading agent only observes current prices
- A poker playing agent only observes public cards
Now agent state $\neq$ environment state
Formally this is a partially observable Markov decision process(POMDP)
Agent must construct its own state representation , e.g.
- Complete history: $S_t^a = H_t$
- Beliefs of environment state: $S_t^a = (P[S_t^e = s^1],...,P[S_t^e = s^n])$
- Recurrent neural network: $S_t^a = \sigma(S_{t-1}^aW_s + O_tW_o)$

Major Components of an RL Agent

An RL agent may include one or more of these components:
- Policy: agent’s behaviour function
- Value function: how good is each state and/or action
- Model: agent’s representation of the environment

Policy

A policy is the agent’s behaviour
It is a map from state to action, e.g. - Deterministic policy: $a = \pi(s)$
Stochastic policy: $\pi(a|s) = P[A_t = a|S_t = s]$

Value Function

Value function is a prediction of future reward
Used to evaluate the goodness/badness of states
And therefore to select between actions, e.g.
$v_{\pi}(s)=E_{\pi}[R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+...|S_t=s]$

Model

A model predicts what the environment will do next
$P$ predicts the next state
$R$ predicts the next (immediate) reward, e.g.
$P_{ss'}^a=P[S_{t+1}=s'|S_t=s,A_t=a]$
$R_s^a=E[R_{t+1}|S_t=s,A_t=a]$

Maze Example

Maze Example: Policy

Maze Example: Value Function

Maze Example: Model

Categorizing RL agents (1)

Value Based
- $\color{gray}{No-Policy (Implicit)}$
- Value Function
Policy Based
- Policy
- $\color{gray}{No-Value-Function}$
Actor Critic
- Policy
- Value Function

Categorizing RL agents (2)

Model Free
- Policy and/or Value Function
- $\color{gray}{No-Model}$
Model Based
- Policy and/or Value Function
- Model

RL Agent Taxonomy

Learning and Planning

Two fundamental problems in sequential decision making

Reinforcement Learning:
- The environment is initially unknown
- The agent interacts with the environment
- The agent improves its policy
Planning:
- A model of the environment is known
- The agent performs computations with its model (without any external interaction)
- The agent improves its policy
- a.k.a. deliberation, reasoning, introspection, pondering, thought, search

Atari Example: Reinforcement Learning

Atari Example: Planning

Exploration and Exploitation (1)

Reinforcement learning is like trial-and-error learning
The agent should discover a good policy
From its experiences of the environment
Without losing too much reward along the way

Exploration and Exploitation (2)

Exploration finds more information about the environment
Exploitation exploits known information to maximise reward
It is usually important to explore as well as exploit

Examples

Restaurant Selection
$\color{blue}{Exploitation}$ Go to your favourite restaurant
$\color{blue}{Exploration}$ Try a new restaurant
Online Banner Advertisements
$\color{blue}{Exploitation}$ Show the most successful advert
$\color{blue}{Exploration}$ Show a different advert
Oil Drilling
$\color{blue}{Exploitation}$ Drill at the best known location
$\color{blue}{Exploration}$ Drill at a new location
Game Playing
$\color{blue}{Exploitation}$ Play the move you believe is best
$\color{blue}{Exploration}$ Play an experimental move

Prediction and Control

Prediction: evaluate the future
- Given a policy
Control: optimise the future
- Find the best policy

Gridworld Example: Prediction

Gridworld Example: Control

Course Outline

Part I: Elementary Reinforcement Learning
1. Introduction to RL
2. Markov Decision Processes
3. Planning by Dynamic Programming
4. Model-Free Prediction
5. Model-Free Control
Part II: Reinforcement Learning in Practice
1. Value Function Approximation
2. Policy Gradient Methods
3. Integrating Learning and Planning
4. Exploration and Exploitation
5. Case study - RL in games

Reference：《UCL Course on RL》