用Keras实现RNN+LSTM的模型自动编写古诗

记录下用Keras实现LSTM模型来自动编写古诗的过程查看原文
代码地址: Github

简介

目前RNN循环神经网络可以说是最强大的神经网络模型之一了，可以处理大量的序列数据，目前已经广泛用于语音识别、文本分类、自然语言处理中了。

现在有很多deep learning的框架能很方便的实现RNN的模型。因为作者比较倾向于Keras，所以本文的代码都是以Keras框架编写的。

当然，网上也有很多其他框架实现的例子，本文中也借鉴了前人的一些代码。只是在学习的过程中，遇到了一些其他文章中没有讲清楚的问题，在这里记录下来。

模型

model.png

模型很简单，首先是两个LSTM+Dropout层，最后一层为全连接层，激活函数用的softmax。

语料数据

poetry.png

语料是从网上下载的，由于下载的时候没有记住来源，所以，默默地感谢前人已经做的准备工作呢。

一共四万多首古诗，每行一首诗，以：分隔标题和诗句，让模型学习的是诗句，所以标题在预处理的时候会去掉。

文件预处理

首先，机器并不懂每个中文汉字代表的是什么，所以要将文字转换为机器能理解的形式。这里我们采用one-hot的形式，具体内容请自行查询。简单地说下就是将所有的文字组成一个字典，每个字就能用该字在字典里的序号表示。比如“我爱吃香蕉”，一共五个字，字典就是["我", "爱", "吃", "香", "蕉"]，那么，"我"就能用[1,0,0,0,0]表示，所以“香蕉”用这种形式表示出来就是一个维度为(2, 5)的向量。

类似地，处理当前的诗句文件也是将所有的字组成一个字典，这样诗句中的每个字都能用向量来表示。

def preprocess_file(Config):
    # 语料文本内容
    files_content = ''
    with open(Config.poetry_file, 'r', encoding='utf-8') as f:
        for line in f:
            # 每行的末尾加上"]"符号代表一首诗结束
            files_content += line.strip() + "]".split(":")[-1]

    words = sorted(list(files_content))
    counted_words = {}
    for word in words:
        if word in counted_words:
            counted_words[word] += 1
        else:
            counted_words[word] = 1

    # 去掉低频的字
    erase = []
    for key in counted_words:
        if counted_words[key] <= 2:
            erase.append(key)
    for key in erase:
        del counted_words[key]
    wordPairs = sorted(counted_words.items(), key=lambda x: -x[1])

    words, _ = zip(*wordPairs)
    words += (" ",)
    # word到id的映射
    word2num = dict((c, i) for i, c in enumerate(words))
    num2word = dict((i, c) for i, c in enumerate(words))
    word2numF = lambda x: word2num.get(x, len(words) - 1)
    return word2numF, num2word, words, files_content

在每行末尾加上]符号是为了标识这首诗已经结束了。我们给模型学习的方法是，给定前六个字，生成第七个字，所以在后面生成训练数据的时候，会以6的跨度，1的步长截取文字，生成语料。比如“我要吃香蕉”，现在以3的跨度生成训练数据就是("我要吃", “香”)，("要吃香", "蕉")。跨度为6的句子中，前后每个字都是有关联的。如果出现了]符号，说明]符号之前的语句和之后的语句是两首诗里面的内容，两首诗之间是没有关联关系的，所以我们后面会舍弃掉包含]符号的训练数据。

生成训练数据

再生成训练数据的时候先看下配置项：

class Config(object):
    # 语料文本
    poetry_file = 'poetry.txt'
    # 保存模型的文件名
    weight_file = 'poetry_model.h5'
    # 跨度
    max_len = 6
    # batch_size
    batch_size = 32
    # learning_rate 
    learning_rate = 0.001

配置很简单，主要就是语料文本的路径、保存模型的文件名等。
下面生成训练数据：

    def data_generator(self):
        '''生成数据'''
        i = 0
        while 1:
            x = self.files_content[i: i + self.config.max_len]
            y = self.files_content[i + self.config.max_len]

            if ']' in x or ']' in y:
                i += 1
                continue

            y_vec = np.zeros(
                shape=(1, len(self.words)),
                dtype=np.bool
            )
            y_vec[0, self.word2numF(y)] = 1.0

            x_vec = np.zeros(
                shape=(1, self.config.max_len, len(self.words)),
                dtype=np.bool
            )

            for t, char in enumerate(x):
                x_vec[0, t, self.word2numF(char)] = 1.0

            yield x_vec, y_vec
            i += 1

x表示输入，y表示输出，输入就是前六个字，输出即为第七个字。再将文字转换成向量的形式。
需要注意的是，这边的生成器是一个while 1的无限循环的过程，官网上是这样说的：

a tuple (inputs, targets, sample_weights). This tuple (a single output of the generator) makes a single batch. Therefore, all arrays in this tuple must have the same length (equal to the size of this batch). Different batches may have different sizes. For example, the last batch of the epoch is commonly smaller than the others, if the size of the dataset is not divisible by the batch size. The generator is expected to loop over its data indefinitely. An epoch finishes when steps_per_epoch batches have been seen by the model.

但是实际上，在我们的生成器中，当i+max_len > len(file_content)的时候，下标就已经超过语料的长度了。所以我们在后面会限制模型学习的循环次数。

构建模型

    def build_model(self):
        '''建立模型'''

        # 输入的dimension
        input_tensor = Input(shape=(self.config.max_len, len(self.words)))
        lstm = LSTM(512, return_sequences=True)(input_tensor)
        dropout = Dropout(0.6)(lstm)
        lstm = LSTM(256)(dropout)
        dropout = Dropout(0.6)(lstm)
        dense = Dense(len(self.words), activation='softmax')(dropout)
        self.model = Model(inputs=input_tensor, outputs=dense)
        optimizer = Adam(lr=self.config.learning_rate)
        self.model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

这边没什么难度，因为Keras对各种layer都已经做了非常好的封装，我们只需要填好参数就好了。这边我有个建议是，在构建自己的神经网络的过程中，最好自己先把每层数据的维度计算出来，有益于排除错误。特别是input的维度。这里说个坑，在Keras中，需要区分两种维度的表示方式，channels_last和channels_first，也就是tensorflow和Theao对维度的表示方法。

训练模型

 def train(self):
        '''训练模型'''
        number_of_epoch = len(self.words) // self.config.batch_size

        if not self.model:
            self.build_model()

        self.model.fit_generator(
            generator=self.data_generator(),
            verbose=True,
            steps_per_epoch=self.config.batch_size,
            epochs=number_of_epoch,
            callbacks=[
                keras.callbacks.ModelCheckpoint(self.config.weight_file, save_weights_only=False),
                LambdaCallback(on_epoch_end=self.generate_sample_result)
            ]
        )

steps_per_epoch表示在一个epoch中调用几次generator生成数据进行训练
number_of_epoch表示要训练多少个epoch
上面所说的，i+max_len的下标不能超过文本总长度，所以，当给定batch_size时，用文本总长度除以batch_size得到number_of_epoch。

callbacks中定义了保存模型的回调函数和每个epoch结束后，随机生成诗句的函数，这样在学习的过程中，可以看到随着训练数据的正常，生成的诗句也越来越正常。

结果

训练次数较少的时候生成的诗句：

训练结束时生成的诗句：

虽然训练到最后写出的诗句都不怎么看得懂，但是，可以看到模型在学习到越来越多的数据后，从一开始标点符号都不会标，到最后写出了有一点点模样的诗句，能看到模型变得越来越聪明了。