Python Package：OpenAI Gym通俗理解和简单实战

OpenAI Gym

为了做实验，发现有文章用OpenAI gym去做些小游戏的控制，主要是为了研究RL的算法，逐渐发现这个gym的例子成了standard test case. 所以，这个blog简单分析下Gym的架构，以及如何安装和使用OpenAI Gym，最后还是附上一个简单的控制案例。

https://gym.openai.com/docs/ 官网的英文文档
http://c.biancheng.net/view/1972.html 中文博客。

0. gym的简介[这部分很简单，请坚持看完]

1.Environments & interface

The gym library is a collection of test problems — environments — that you can use to work out your reinforcement learning algorithms. These environments have a shared interface, allowing you to write general algorithms.

environment是gym的重要概念，一个env就是一个场景，就是一个test case，据说gym有700多个env（现在实测是859个）。共同的接口（shared interface）是怎么定义的呢？且看下面的一个简单例子，cart-pole：

import gym
env = gym.make('CartPole-v0') #1.构造env， 根据name指定 
env.reset()                                           #2.初始化env
for _ in range(1000):
    env.render()                                   #3.渲染
    env.step(env.action_space.sample()) # take a random action#4.action
env.close()

效果请点击http://s3-us-west-2.amazonaws.com/rl-gym-doc/cartpole-no-reset.mp4
从例子中可以看出，标准接口是（a)根据名字设置env; b)render渲染场景；c)step(control)函数更新一次；）这三个函数。

2.env_name

gym.make(env_name)

gym有很多env，到底怎么选择其中一个环境呢？官方网址提供了相关的解释. env_name 实际上放在这个文件下的https://github.com/openai/gym/blob/master/gym/envs/init.py，点击即可查看到所有的env_name. 采用程序当然也可以展示所有可以使用的env：

from gym import envs
print(envs.registry.all())
print(len(envs.registry.all())) #859

下面是我用到的env_name, 对应的描述。

# Classic
# ----------------------------------------

register(
    id='CartPole-v0',
    entry_point='gym.envs.classic_control:CartPoleEnv',
    max_episode_steps=200,
    reward_threshold=195.0,
)

register(
    id='CartPole-v1',
    entry_point='gym.envs.classic_control:CartPoleEnv',
    max_episode_steps=500,
    reward_threshold=475.0,
)
register(
    id='Pendulum-v0',
    entry_point='gym.envs.classic_control:PendulumEnv',
    max_episode_steps=200,
)

register(
    id='Acrobot-v1',
    entry_point='gym.envs.classic_control:AcrobotEnv',
    reward_threshold=-100.0,
    max_episode_steps=500,
)

1. Observations

在环境（env)构建好后，观测是做控制或者RL的第一步。observation是step函数返回的。

observation，reward，done， info=env.step(env.action_space.sample())

observation (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.

reward (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.

done (boolean): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)

info (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.

考虑 observation的python代码是这样的:

import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset() # reset 得到初始的observation
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action) #*_*!
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()

1. Action space

控制量的space当然是仿真模拟系统决定的，也就是env中包含了控制量的类型和种类。在RL中policy 就是在space A中选择一个a。先看space，包括action space 和 observation space。

import gym
env = gym.make('CartPole-v0')
print(env.action_space)
#> Discrete(2) # 离散类型，{0,1}，离散控制量从0开始。
print(env.observation_space)
#> Box(4,)         # 连续闭区间， 4个值均处于各自的 连续闭区间内。具体范围看下面
print(env.observation_space.high)
#> array([ 2.4       ,         inf,  0.20943951,         inf])
print(env.observation_space.low)
#> array([-2.4       ,        -inf, -0.20943951,        -inf])

这个时候，回头看前面代码中的step函数, 就是在action space中随机sample了一个action。

 action = env.action_space.sample() #恍然大悟，在A中随机选一个控制a.
 observation, reward, done, info = env.step(action)

1. Render

Render中文名称叫做渲染，实际上就是将env系统状态展示出来，根据应用不同，有两种情况（给人看，给机器看），render函数定义很清楚，有model这个参数，根据情况不同，很有必要简单设置下：

    def render(self, mode='human'):
        """Renders the environment.

        The set of supported modes varies per environment. (And some
        environments do not support rendering at all.) By convention,
        if mode is:

        - human: render to the current display or terminal and
          return nothing. Usually for human consumption.
        - rgb_array: Return an numpy.ndarray with shape (x, y, 3),
          representing RGB values for an x-by-y pixel image, suitable
          for turning into a video.
        - ansi: Return a string (str) or StringIO.StringIO containing a
          terminal-style text representation. The text can include newlines
          and ANSI escape sequences (e.g. for colors).

        Note:
            Make sure that your class's metadata 'render.modes' key includes
              the list of supported modes. It's recommended to call super()
              in implementations to use the functionality of this method.

        Args:
            mode (str): the mode to render with"""

1.安装

1.pip 安装

pip install gym

2.源码安装

git clone https://github.com/openai/gym
cd gym
pip install -e .

2.使用

import gym
# 更多的，想必聪明的你，已经会了。

3. Refs

读者稍觉得有用，不坑，请点赞支持下，让更多人参考。