Lecture1 Introduction to Reinforcement Learning

book: An Introduction to Reinforcement Learning . Sutton and Barto, 1998

book: Algorithms for Reinforcement Learning. Szepesvari

Abort Reinforcement Learning

强化学习是多种学科交叉的领域，也许本质上是一个决策学科，目的是以最佳的方式来指定决策。

在工程领域就是花费大量时间来寻求最佳控制。

强化学习在不同的领域，有不同的叫法，其实就是进行一系列的活动，才能最终得到好的结果。

在神经科学领域，近期最主要的发现实际上是：理解人类大脑是如何做出决策的，（人脑做出决策）很大程度上是依赖于多巴胺系统，该系统通过传递神经递质多巴胺，实际上反映了我们在这门课程学习的主要算法，

心理学、数学（等价公式、运筹学）、经济学领域（博弈论）

强化学习是不同于监督学习的，但是也不属于无监督学习

不同之处：

there is no supervisor, only a reward signal, no one tell us the right action to take, instead, 更加类似于一个小孩不断试错的示例。不会告诉你什么行为是最好的，可能会告诉你什么行为是错误的、正确的或者给一个分数。
Feedback is delayed, not instantaneous. 在强化学习中，当你做出了一个决定后，可能会在很多很多步之后，才知道这是一个正确的决策，还是一个错误的决策。通过时间的流逝，对过去决策的回顾，你才会意识到有可能做出了一个错误的选择。因为可能在经历很多步之后，你那个时候的决定却可能给你带来灾难性的损失。
Time Really matters(sequential, non i.i.d data)在我们谈论顺序决策时，也就是一步接着一步，agent采取决策，并计算采取措施之后，会得到多少奖励。然后，agent会尝试修改策略，以期望最终尽可能获得最多的奖励。

我们不是在讨论传统的监督学习或者非监督学习，在这些学习过过程中，只需要将独立同分布的数据丢给机器，让机器自己去学习就可以了。然而在这里，我们需要应对的是一个动态的系统，agent要和外部环境进行交互。对于强化学习来说，独立同分布的条件已经被破环了，agent是根据环境的影响来采措施，agent每一步做出的决策，都会影响它所接收的数据。这是一种主动的学习过程。是由不同的数据集组合而成的。

Rewards

A Rewards $R_t$ is a scalar feedback signal（一个标量，一个反馈信号）
Indicates how well agent is doing at step t
The agent's job is to maximise cumulative reward.

Definition (Reward Hypothesis)

All goals can be described by the maximisation of expected cumulative reward.

Sequential Decision Making

Goal: select actions to maximise total future reward.
Actions may have long term consequences
Reward may be delayed
It may be better to sacrifice immediate reward to gain more long-term reward.
Examples:
- A financial investment.
- Refuelling a helicopter

History and State

The history is the sequence of observations, actions, rewards
$H_t = A_1, O_1, R_1,...,A_t, O_t, R_t$
history 就是目前为止，agent所知的所有信息。每一步都会采取行动、进行观察，采取奖励。
i.e. all observable variables up to time t.
我们模拟的是人类的大脑，agent通过自己的感知器官获知它们所“看到”的东西，输入是agent看到的东西，输出是做出的决定，agent和环境之间需要一个良好的接口，我们需要做的就是控制好这些接口。
What happens next depends on the history:
- The agent selects actions（我们创建的算法，其实是history到action的映射，我们的目标是创建一个映射，然后算法是从一个history中挑选下一个action的映射，agent的下一个action是什么，完全依赖与history）
- The environment selects observations/rewards（环境根据history会发生变化）
State is the information used to determine what happens next
Formally, state is a function of the history:
$S_t = f(H_t)$

Information State

An information state (aka. Markov state) contains all useful information from the hsitory.

definition

A state $S_t$ is Markov if and only if
$\mathbb{P}[S_{t+1}|S_t] = \mathbb{R}[S_{t+1}|S_1,...,S_t]$
也就是说，我们可以把之前的状态全部丢弃掉，只需要保留当前的状态（In other way, you can throw all of previous states, and just retain your current state.), 未来的状态也是具有马尔可夫性的。

The future is independent of the past given the present
$H_{1:t}\rightarrow S_t \rightarrow H_{t+1:\infty }$
也就是说可以用state代表整个history，因此可以放弃history

Example

我们认为接下来会发生什么，取决于我们的状态表示方法。

Fully Observable Environments

**Full observation: ** agent directly observes environments state
$O_t = S_t^a = S_t^e$

Agent state = Environment state = Information state
Formally. this is a Markov decision process(MDP)

Partially Observable Environments

Partial observability: agent indirectly observes environment.
- A robot with camera vision isn't told its absolute location
- A trading agent only observes current prices
- A poker playing agent only observes public carts.
Now $agent state \neq environment state$
Formally, this is a partially observable Markov decision process(POMDP)
Agent must construct its own state representation $S$
- Complete history: $S_t = H_t$
- Beliefs of environment state: $S_t = (\mathbb {P}[S_t^e = S^1],...)$
- Recurrent neural network: $S_t^a = \delta[S_{t-1}^eW_s + O_tW_e]$

Inside An RL Agent

An RL agent may include one or more of these components:
- Policy: agent's behaviour function.
- Value function: how good is each state and/or action
- Model: agent's representation of the environment.(用来表示agent眼中的环境是怎么样的)

Policy

A policy is the agent's behaviour
It is a map from state to action, e.g.
Deterministic policy: $a = \pi(s)$ 当前的状态s经过函数的映射，可以得到一个actoin，即将要采取的动作。
Stochastic policy: $\pi(a|s) = \mathbb{P}[A=a|S=s]$

Value Function

Value function is a prediction of future reward.
Used to evaluate the goodness/badness of states
And therefore to select between actions,
$V_{\pi}(s) = \mathbb{E}[R_t + \gamma R_{t+1} + {\gamma}^2 R_t+2 + ... | S_t = s]$

Model

A model predicts what the environment will do next. (model并不是环境本身，但是他对于预测环境变化非常重要，我们model会学习环境的变化，然后可以用来确定计划，model对下一步的行动很有用处)
Transitions: $\mathcal{p}$ predict the next state.(i.e. dynamics)
Rewards: $\mathcal{R}$ predicts the next (immediate) reaward, ${\mathcal{p}}_{ss^`}^a = \mathbb[S = s | S = s_t, A = a]$

Maze Example

Reward: -1 per time step
Actions: N, E, S, W
States: Agent's location

Policy

mapping from state to action, 每一个状态（就是agent所在的每一个格子）都有一个箭头，代表着如果在这个状态之下，agent下一步将会去往哪一个方向。

Value Function

...

Model

每一步都会立即获得一个奖励-1

Categorizing RL agent(1)

Value Based
- No Policy(Implicit)
- Value Function
Policy Based
- Policy
- No Value Function
Actor Critic
- Policy
- Value Function

Categorizing RL agents(2)

没有model意味着我们不会去尝试理解环境，我们并不会创建一个动态特性模型

我们会直接使用一个Policy或者一个Value Function，我们就能知道如何采取行动，才能获取最高的奖励。我们并不需要知道环境的状态是如何改变的。

Model Free
- Policy and/or Value Function
- No Model

与之对应。我们可以创建基于model的强化学习模型。第一步就是建立一个模型去表示环境的工作原理，就像是建立一个关于直升飞机的动态model，通过这个model，我们就可以知道接下来会发生什么，并找到最优的行动方式。

RL Agent Taxonomy

....

Problems within Reinforcement Learning

Two fundamental problems in sequential decision making

Reinforcement learning
- The environment is initially unknown
- The agent interacts with environment
- The agent improves its policy
Planning(规划问题)
- A model of the environment is known
- The agent performs computations with its model(without any external interaction)
- The agent improves its policy

Exploration and Exploitation(1)

探索与开拓强化学习是一种不断试错的学习方式。

Exploration and Exploitation(2)

Exploration finds more information about the environment
Exploitation exploits knowns information to maximise reward.

Prediction and Control

Prediction: evaluate the future
- Given a policy
Control: optimise the future
- Find the best policy

Lecture1 Introduction to Reinforcement Learning