Chapter 5

Chapter 5: Monte Carlo Methods

Monte Carlo (MC) methods are learning methods for estimating value functions and discovering optimal policies by averaging complete returns of sample experience. Some high-level characteristics include:

  1. Unlike DP, MC does not require the model dynamics. Instead, it learns from (simulated) experience. This is valuable because in many cases, it is easy to generate sample tracjectories but infeasible to obtain the model dynamics in explicit form.
  2. MC needs compete returns, thus is only suitable for episodic tasks. It can be incremental in an episode-by-episode sense, but not in a step-by-step (online) sense.
  3. MC does not bootstrap, thus is unbiased. It also provides another advantage that we can only estimate the value of the states we are interested in, regardless of the size of the state space.
    Following the general framework of GPI, MC also consists of a value estimation (prediction) phase and a policy improvement (control) phase.

Monte Carlo Prediction

This phase learns the value functions for a given policy.
For estimating state value v_\pi(s), it works by simply averaging the returns of s in different sample episodes. It can be further categorized as first-visit MC and every-visit MC, both guaranteed to converge as the number of visits to s goes to infinity.
As we can not derive the optimal policy based on state values without knowing the model dynamics, it is more important to estimation state-action values q(s, a) in MC. An important problem here is how to maintain exploration so that every state-action pair will be visited to for value estimation to find the optimal policy. This will be discussed in the next part.
The backup diagram of MC is a single line of the sampled trajectory.

Monte Carlo Control

MC is guaranteed to converge to the optimal policy under the following two assumptions:

  1. Each state-action pair has a non-zero probability of being visited;
  2. The policy evaluation can be done with a infinite number of episodes.

However, these two assumptions are usually unrealistic in real-world problems. A solution to tackle the second assumption is to approximate with a finite number of episodes at the cost of higher variance.
Theoritically, a solution to the first assumption is exploring start, which ensures that every state-action pair will be visited at the episode start. However, this is infeasible in most cases, as we can not generate arbitrary trajectories as we like. The solutions to this problem include on-policy methods and off-policy methods.

On-policy Control

On-policy methods directly improve the policy that is used to generate data. It ensures exploration by learning a \epsilon-greedy policy instead of a deterministic policy. The cost here is that we can only obtain the optimal \epsilon-soft policy instead of the truly optimal policy.

Off-policy Control

Off-policy methods learn an optimal target policy \pi which is different from a behavior policy b used to generate sample trajectories. We can use a soft behavior policy to ensure exploration while still learn a deterministic target policy.
The key challenge here is that we can not directly average the returns from the behavior policy, as it does not reflect the value functions of the target policy, i.e., \mathbb{E}[G_t | S_t=s] = v_b(s).
To solve this problem, we introduce importance sampling which reweights the samples from the behavior policy to align with the sample distribution under the target policy. The importance sampling ratio is defined as: \rho_{t:T-1} = \frac{\prod_{k=t}^{T-1} \pi(A_k|S_k)p(S_{k+1}|S_k, A_k)}{\prod_{k=t}^{T-1} b(A_k|S_k)p(S_{k+1}|S_k, A_k)} = \prod_{k=t}^{T-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}. Then we have \mathbb{E}[\rho_{t:T-1} G_t | S_t = s] = v_\pi(s). There are two approaches to get an estimation of v_\pi(s). The first is ordinary importance sampling in the form of V(s) = \frac{\sum_{k=1}^{n-1} W_k G_k}{n-1}, while the second is weighted importance sampling in the form of V(s) = \frac{\sum_{k=1}^{n-1} W_k G_k}{\sum_{k=1}^{n-1} W_k}, where we use W_k to represent the importance sampling ratio for simplicity. Ordinary importance sampling is unbiased, but has much larger or even infinite variance. Weighted importance sampling has bounded variance, and its bias reduces as the number of samples increases, thus is strongly preferrred in practice. Moreover, the weighted estimator can be expressed in an incremental way as V_{n+1} = V_n + \frac{W_n}{C_n} \big [ G_n - V_n \big ], where C_{n+1}=C_n + W_{n+1}.
The off-policy MC control is shown in the figure below. A problem here is highlighted in the last two rows. When the target policy is deterministic, the importance sampling ratio will become zero when the bahavior policy takes an action which is not optimal in the current target policy. When nongreedy actions are common, it will only learns from the tails of episodes and greatly slow down the learning process.

Off-policy MC control.

Approximations in MC

  1. Truncated policy evaluation with only a finite number of sample episodes.
  2. Use returns of previous sampled trajectories under different policies for policy evaluation.

Under these approximations, is it still wise to fully trust the estimated values to determine a greedy policy improvement?

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • 久违的晴天,家长会。 家长大会开好到教室时,离放学已经没多少时间了。班主任说已经安排了三个家长分享经验。 放学铃声...
    飘雪儿5阅读 12,187评论 16 22
  • 今天感恩节哎,感谢一直在我身边的亲朋好友。感恩相遇!感恩不离不弃。 中午开了第一次的党会,身份的转变要...
    迷月闪星情阅读 13,585评论 0 11
  • 可爱进取,孤独成精。努力飞翔,天堂翱翔。战争美好,孤独进取。胆大飞翔,成就辉煌。努力进取,遥望,和谐家园。可爱游走...
    赵原野阅读 7,708评论 1 1
  • 在妖界我有个名头叫胡百晓,无论是何事,只要找到胡百晓即可有解决的办法。因为是只狐狸大家以讹传讹叫我“倾城百晓”,...
    猫九0110阅读 8,577评论 7 3