Chapter 2: Multi-armed Bandits
Multi-armed bandits can be seen as the simplest form of reinforcement learning, where there is only a single state. The key point here is how to estimate the action values. This chapter mainly aims to introduce some key concepts in RL, such as value estimation, exploration-exploitation dilemma and policy gradient, under a simplified setting.
Different methods for value estimation:
-
Sample average:
Pros: Easy to understand and implement; unbiased estimation; convergence guanrantee with infinite time step under stationary environment
Cons: Time and memory complexity increases with the step number -
Incremental form:
This is equivalent to sample average but introduces an important update rule in RL, i.e., However, we need to note that the here is not the true action value , but an unbiased estimation , thus will introduce noise into the update rule.
Pros: constant time and momory cost
Cons: step size decreases with time step, thus is not suitable for nonstationary environments -
Constant step size:
Pros: exponential weighted average over historical rewards, thus is suitable for nonstationary environments
Cons: biased estimation of action values; no convergence guarantee
Different methods for exploration:
- -greedy: the effect of different values of on the learning curve; time-variant scheme
-
Optimistic intial values: encourage exploration at the early stage, but not a general method
initial values as prior knowledge (MAML) -
Upper-confidence-bound (UCB):
Closely related to other methods like Bayesian optimization
Gradient Bandit Algorithms
Instead of selecting action based on the learned action values, directly learn a numerical preference for each action.
Follow the derivation in the book, which formulates it as a SGD method.
Baseline can help reduce variance.
Contextual Bandits
At each time step, the agent is presented with a random bandit from a set of different bandits, and the identity of the bandit is known.
It is different from multi-armed bandits, as different states (bandits) are involved.
It is different from RL as the action in each step only influences the immediate reward, but has no effect on the next state and the rewards in the future.