Chapter 5 Chapter 5: Monte Carlo Methods

Monte Carlo (MC) methods are learning methods for estimating value functions and discovering optimal policies by averaging complete returns of sample experience. Some high-level characteristics include:

Unlike DP, MC does not require the model dynamics. Instead, it learns from (simulated) experience. This is valuable because in many cases, it is easy to generate sample tracjectories but infeasible to obtain the model dynamics in explicit form.
MC needs compete returns, thus is only suitable for episodic tasks. It can be incremental in an episode-by-episode sense, but not in a step-by-step (online) sense.
MC does not bootstrap, thus is unbiased. It also provides another advantage that we can only estimate the value of the states we are interested in, regardless of the size of the state space.
Following the general framework of GPI, MC also consists of a value estimation (prediction) phase and a policy improvement (control) phase.

Monte Carlo Prediction

This phase learns the value functions for a given policy.
For estimating state value $v_\pi(s)$ , it works by simply averaging the returns of $s$ in different sample episodes. It can be further categorized as first-visit MC and every-visit MC, both guaranteed to converge as the number of visits to $s$ goes to infinity.
As we can not derive the optimal policy based on state values without knowing the model dynamics, it is more important to estimation state-action values $q(s, a)$ in MC. An important problem here is how to maintain exploration so that every state-action pair will be visited to for value estimation to find the optimal policy. This will be discussed in the next part.
The backup diagram of MC is a single line of the sampled trajectory.

Monte Carlo Control

MC is guaranteed to converge to the optimal policy under the following two assumptions:

Each state-action pair has a non-zero probability of being visited;
The policy evaluation can be done with a infinite number of episodes.

However, these two assumptions are usually unrealistic in real-world problems. A solution to tackle the second assumption is to approximate with a finite number of episodes at the cost of higher variance.
Theoritically, a solution to the first assumption is exploring start, which ensures that every state-action pair will be visited at the episode start. However, this is infeasible in most cases, as we can not generate arbitrary trajectories as we like. The solutions to this problem include on-policy methods and off-policy methods.

On-policy Control

On-policy methods directly improve the policy that is used to generate data. It ensures exploration by learning a $\epsilon$ -greedy policy instead of a deterministic policy. The cost here is that we can only obtain the optimal $\epsilon$ -soft policy instead of the truly optimal policy.

Off-policy Control

Off-policy methods learn an optimal target policy $\pi$ which is different from a behavior policy $b$ used to generate sample trajectories. We can use a soft behavior policy to ensure exploration while still learn a deterministic target policy.
The key challenge here is that we can not directly average the returns from the behavior policy, as it does not reflect the value functions of the target policy, i.e., $\mathbb{E}[G_t | S_t=s] = v_b(s)$ .
To solve this problem, we introduce importance sampling which reweights the samples from the behavior policy to align with the sample distribution under the target policy. The importance sampling ratio is defined as: $\rho_{t:T-1} = \frac{\prod_{k=t}^{T-1} \pi(A_k|S_k)p(S_{k+1}|S_k, A_k)}{\prod_{k=t}^{T-1} b(A_k|S_k)p(S_{k+1}|S_k, A_k)} = \prod_{k=t}^{T-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}.$ Then we have $\mathbb{E}[\rho_{t:T-1} G_t | S_t = s] = v_\pi(s).$ There are two approaches to get an estimation of $v_\pi(s)$ . The first is ordinary importance sampling in the form of $V(s) = \frac{\sum_{k=1}^{n-1} W_k G_k}{n-1}$ , while the second is weighted importance sampling in the form of $V(s) = \frac{\sum_{k=1}^{n-1} W_k G_k}{\sum_{k=1}^{n-1} W_k}$ , where we use $W_k$ to represent the importance sampling ratio for simplicity. Ordinary importance sampling is unbiased, but has much larger or even infinite variance. Weighted importance sampling has bounded variance, and its bias reduces as the number of samples increases, thus is strongly preferrred in practice. Moreover, the weighted estimator can be expressed in an incremental way as $V_{n+1} = V_n + \frac{W_n}{C_n} \big [ G_n - V_n \big ]$ , where $C_{n+1}=C_n + W_{n+1}$ .
The off-policy MC control is shown in the figure below. A problem here is highlighted in the last two rows. When the target policy is deterministic, the importance sampling ratio will become zero when the bahavior policy takes an action which is not optimal in the current target policy. When nongreedy actions are common, it will only learns from the tails of episodes and greatly slow down the learning process.

Off-policy MC control.

Approximations in MC

Truncated policy evaluation with only a finite number of sample episodes.
Use returns of previous sampled trajectories under different policies for policy evaluation.

Under these approximations, is it still wise to fully trust the estimated values to determine a greedy policy improvement?

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 214,717评论 6赞 496
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 91,501评论 3赞 389
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 160,311评论 0赞 350
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 57,417评论 1赞 288
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 66,500评论 6赞 386
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 50,538评论 1赞 293
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 39,557评论 3赞 414
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,310评论 0赞 270
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 44,759评论 1赞 307
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,065评论 2赞 330
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,233评论 1赞 343
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 34,909评论 5赞 338
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 40,548评论 3赞 322
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,172评论 0赞 21
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,420评论 1赞 268
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 47,103评论 2赞 365
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,098评论 2赞 352

Chapter 5

Chapter 5: Monte Carlo Methods

Monte Carlo Prediction

Monte Carlo Control

On-policy Control

Off-policy Control

Approximations in MC

推荐阅读更多精彩内容