Lecture 1: Introduction to Reinforcement Learning

Author:David Silver

Outline

  1. Admin
  2. About Reinforcement Learning
  3. The Reinforcement Learning Problem
  4. Inside An RL Agent
  5. Problems within Reinforcement Learning

Many Faces of Reinforcement Learning

Branches of Machine Learning

Characteristics of Reinforcement Learning

What makes reinforcement learning different from other machine learning paradigms?

  • There is no supervisor, only a reward signal
  • Feedback is delayed, not instantaneous
  • Time really matters (sequential, non i.i.d data)
  • Agent’s actions affect the subsequent data it receives

Examples of Reinforcement Learning

  • Fly stunt manoeuvres in a helicopter
  • Defeat the world champion at Backgammon
  • Manage an investment portfolio
  • Control a power station
  • Make a humanoid robot walk
  • Play many different Atari games better than humans

Rewards

  • A reward R_t is a scalar feedback signal
  • Indicates how well agent is doing at step t
  • The agent’s job is to maximise cumulative reward

Reinforcement learning is based on the reward hypothesis


Do you agree with this statement?

Examples of Rewards

  1. Fly stunt manoeuvres in a helicopter
    • +ve reward for following desired trajectory
    • −ve reward for crashing
  2. Defeat the world champion at Backgammon
    • +/−ve reward for winning/losing a game
  3. Manage an investment portfolio
    • +ve reward for each $ in bank
  4. Control a power station
    • +ve reward for producing power
    • −ve reward for exceeding safety thresholds
  5. Make a humanoid robot walk
    • +ve reward for forward motion
    • −ve reward for falling over
  6. Play many different Atari games better than humans
    • +/−ve reward for increasing/decreasing score

Sequential Decision Making

  • Goal: select actions to maximise total future reward
  • Actions may have long term consequences
  • Reward may be delayed
  • It may be better to sacrifice immediate reward to gain more long-term reward
  • Examples:
    • A financial investment (may take months to mature)
    • Refuelling a helicopter (might prevent a crash in several hours)
    • Blocking opponent moves (might help winning chances many moves from now)

Agent and Environment

History and State

  • The history is the sequence of observations, actions, rewards :
    H_t=O_1,R_1,A_1,...A_{t-1},O_t,R_t
  • i.e. all observable variables up to time t
  • i.e. the sensorimotor stream of a robot or embodied agent
  • What happens next depends on the history:
    • The agent selects actions
    • The environment selects observations/rewards
  • State is the information used to determine what happens next
  • Formally, state is a function of the history:
    S_t=f(H_t)

Environment State

Agent State

Information State

An information state (a.k.a. Markov state) contains all useful information from the history.

  • “The future is independent of the past given the present”
    H_{1:t}\longrightarrow S_t\longrightarrow H_{t+1:\infty}
  • Once the state is known, the history may be thrown away
  • i.e. The state is a sufficient statistic of the future
  • The environment state S_t^e is Markov
  • The history H_t is Markov

Rat Example

Fully Observable Environments

Partially Observable Environments

  • Partial observability: agent indirectly observes environment:
    • A robot with camera vision isn’t told its absolute location
    • A trading agent only observes current prices
    • A poker playing agent only observes public cards
  • Now agent state \neq environment state
  • Formally this is a partially observable Markov decision process(POMDP)
  • Agent must construct its own state representation S_t^a, e.g.
    • Complete history: S_t^a = H_t
    • Beliefs of environment state: S_t^a = (P[S_t^e = s^1],...,P[S_t^e = s^n])
    • Recurrent neural network: S_t^a = \sigma(S_{t-1}^aW_s + O_tW_o)

Major Components of an RL Agent

  • An RL agent may include one or more of these components:
    • Policy: agent’s behaviour function
    • Value function: how good is each state and/or action
    • Model: agent’s representation of the environment

Policy

  • A policy is the agent’s behaviour
  • It is a map from state to action, e.g. - Deterministic policy: a = \pi(s)
  • Stochastic policy: \pi(a|s) = P[A_t = a|S_t = s]

Value Function

  • Value function is a prediction of future reward
  • Used to evaluate the goodness/badness of states
  • And therefore to select between actions, e.g.
    v_{\pi}(s)=E_{\pi}[R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+...|S_t=s]

Model

  • A model predicts what the environment will do next
  • P predicts the next state
  • R predicts the next (immediate) reward, e.g.
    P_{ss'}^a=P[S_{t+1}=s'|S_t=s,A_t=a]
    R_s^a=E[R_{t+1}|S_t=s,A_t=a]

Maze Example

Maze Example: Policy

Maze Example: Value Function

Maze Example: Model

Categorizing RL agents (1)

  1. Value Based
    • \color{gray}{No-Policy (Implicit)}
    • Value Function
  2. Policy Based
    • Policy
    • \color{gray}{No-Value-Function}
  3. Actor Critic
    • Policy
    • Value Function

Categorizing RL agents (2)

  1. Model Free
    • Policy and/or Value Function
    • \color{gray}{No-Model}
  2. Model Based
    • Policy and/or Value Function
    • Model

RL Agent Taxonomy

Learning and Planning

Two fundamental problems in sequential decision making

  1. Reinforcement Learning:

    • The environment is initially unknown
    • The agent interacts with the environment
    • The agent improves its policy
  2. Planning:

    • A model of the environment is known
    • The agent performs computations with its model (without any external interaction)
    • The agent improves its policy
    • a.k.a. deliberation, reasoning, introspection, pondering, thought, search

Atari Example: Reinforcement Learning

Atari Example: Planning

Exploration and Exploitation (1)

  • Reinforcement learning is like trial-and-error learning
  • The agent should discover a good policy
  • From its experiences of the environment
  • Without losing too much reward along the way

Exploration and Exploitation (2)

  • Exploration finds more information about the environment
  • Exploitation exploits known information to maximise reward
  • It is usually important to explore as well as exploit

Examples

  • Restaurant Selection
    \color{blue}{Exploitation} Go to your favourite restaurant
    \color{blue}{Exploration} Try a new restaurant

  • Online Banner Advertisements
    \color{blue}{Exploitation} Show the most successful advert
    \color{blue}{Exploration} Show a different advert

  • Oil Drilling
    \color{blue}{Exploitation} Drill at the best known location
    \color{blue}{Exploration} Drill at a new location

  • Game Playing
    \color{blue}{Exploitation} Play the move you believe is best
    \color{blue}{Exploration} Play an experimental move

Prediction and Control

  • Prediction: evaluate the future
    • Given a policy
  • Control: optimise the future
    • Find the best policy

Gridworld Example: Prediction

Gridworld Example: Control

Course Outline

  • Part I: Elementary Reinforcement Learning
    1. Introduction to RL
    2. Markov Decision Processes
    3. Planning by Dynamic Programming
    4. Model-Free Prediction
    5. Model-Free Control
  • Part II: Reinforcement Learning in Practice
    1. Value Function Approximation
    2. Policy Gradient Methods
    3. Integrating Learning and Planning
    4. Exploration and Exploitation
    5. Case study - RL in games

Reference:《UCL Course on RL》

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

  • A survey on value-based deep reinforcement learning ABSTR...
    Jabes阅读 557评论 0 0
  • 一、关于RL (一)强化学习的特征 强化学习和其他机器学习的不同之处: 没有监督者,只有一个reward标志 反馈...
    六回彬阅读 567评论 0 0
  • 渐变的面目拼图要我怎么拼? 我是疲乏了还是投降了? 不是不允许自己坠落, 我没有滴水不进的保护膜。 就是害怕变得面...
    闷热当乘凉阅读 4,450评论 0 13
  • 感觉自己有点神经衰弱,总是觉得手机响了;屋外有人走过;每次妈妈不声不响的进房间突然跟我说话,我都会被吓得半死!一整...
    章鱼的拥抱阅读 2,323评论 4 5
  • 夜莺2517阅读 127,952评论 1 9

友情链接更多精彩内容