首先我想说下为什么会去学习cs224d,原先我一直是做工程的,做了大概3年,产品做了好多,但是大多不幸夭折了,上线没多久就下线,最后实在是经受不住心灵的折磨,转行想做大数据,机器学习的,前不久自己学习完了Udacity的深度学习,课程挺好,但是在实际工作中,发现课程中的数据都是给你准备好的,实践中哪来这么多好的数据,只能自己去通过各种手段搞数据,苦不堪言。在找数据的过程中,发现做多的数据还是文本数据,不懂个nlp怎么处理呢,于是就来学习cs224d这门课程,希望在学习过程中能快速将课程所学应用到工作中,fighting!
以下文字都是从一个初学者角度来写的,如有不严肃不正确的,请严肃指出。
语言模型
刚接触nlp的时候,经常看到有个词叫 language model,于是先回答下什么语言模型?语言模型简单点说就是评价一句话是不是正常人说出来的,然后如果用一个数学公式来描述就是:
举一个具体例子来说明上面公式的含义:
我喜欢自然语言处理,这句话分词后是:"我/喜欢/自然/语言/处理",于是上面的公式就变为:
P(我,喜欢,自然,语言,处理)=p(我)p(喜欢 | 我)p(自然 | 我, 喜欢)p(语言 | 我,喜欢,自然)p(处理 | 我,喜欢,自然,语言)
上面的p(喜欢 | 我)表示”喜欢“出现在”我“之后的概率,然后将所有概率乘起来就是整句话出现的概率。
我们对上面的公式做一个更一般化的表示,将wt出现的前面一大堆条件统一记为Context
,于是就有了下面的公式:
下面我们针对Context的具体形式介绍一种常见的语言模型N-Gram.
N-Gram
上面的介绍的公式考虑到词和词之间的依赖关系,但是比较复杂,对于4个词的语料库,我们就需要计算 4!+3!+2!+1! 个情况,稍微大点,在实际生活中几乎没办法使用,于是我们就想了很多办法去近似这个公式,于是就有了下面要介绍的N-Gram模型。
上面的 context 都是这句话中这个词前面的所有词作为条件的概率,N-gram 就是只管这个词前面的 n-1 个词,加上它自己,总共 n 个词,于是上面的公式就简化为:
当n=1时,单词只和自己有关,模型称为一元模型,n=2时就是bigram,n=3,trigram,据统计在英文语料库IBM, Brown中,三四百兆的语料,其测试语料14.7%的trigram和2.2%的bigram在训练语料中竟未出现!因此在实际中我们应该根据实际情况正确的选择n,如果n很大,参数空间过大,也无法实用。
- 实际中,最多就用到trigram,再多计算量变大,但是效果提升不明显
- 更大的n,理论上能提供的信息会更多,具有更多的信息
- 更小的n,在语料中出现的次数会更多,具有更多的统计信息
word2vec
前面我们已经有了计算概率公式p(w_i|Context_i)了,对于传统的统计方法,我们可以事先在语料上将所有可能的组合都计算出来,那有没有什么数学方法,能不通过语料统计方法,直接将p(w_i|Context_i)计算出来呢?再说的数学点,通过什么方法能够将其拟合出来,数学上的描述就是:
具体拟合的方法有各种各样,其中一个比较厉害的就是通过神经网络来拟合,也就是本文要介绍的word2vec。
word2vec的理论部分,网上已经有很好的资料,推荐 word2vec 中的数学原理详解(一)目录和前言,我主要会以具体的实现为主,有喜欢看视频的同学也可以看Udacity 课程视频。
word2vec尝试着将词都映射到一个高维空间,每个词都可以用一个稠密向量来表示,而这个词向量怎么计算出来,采用的方法是一种无监督方法,假设是词的含义由其周围的词来表示:相似的词,会有相似的上下文。
在具体计算词向量的时候,有两种模型:Skip-Gram 和 CBOW,
我们先介绍skip-gram的原理,其训练过程是:
把词cat
放进 Embeddings 向量空间,然后做一次线性计算,然后取 softmax,得到一个一批 0-1 的数值,然后 cross_entropy,产出预测词purr
。跟目标比对,然后调整。这就是训练过程。如下图:
可能还是有点抽象,我们接着看代码来详细说明上面的过程,代码地址:5_word2vec.ipynb
生成Skip-Gram数据
Skip-Gram思想是通过单词来预测其周围的单词,此时,如果输入数据是:data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first'],那输出是:
with num_skips = 2 and skip_window = 1:
batch: ['originated', 'originated', 'as', 'as', 'a', 'a', 'term', 'term']
labels: ['as', 'anarchism', 'a', 'originated', 'term', 'as', 'a', 'of']
%matplotlib inline
import collections
import math
import numpy as np
import tensorflow as tf
import os
import random
import zipfile
from matplotlib import pylab
from six.moves import range
from six.moves.urllib.request import urlretrieve
from sklearn.manifold import TSNE
import jieba
import re
import sklearn
import multiprocessing
import seaborn as sns
from pyltp import SentenceSplitter
import pandas as pd
# sents = SentenceSplitter.split('元芳你怎么看?我就趴窗口上看呗!')
content = ""
with open("../input/人民的名义.txt",'r') as f:
for line in f:
line = line.strip("\n")
content += line
stop_words = []
with open("../input/stop-words.txt") as f:
for word in f:
stop_words.append(word.strip("\n"))
# 进行分词
accepted_chars = re.compile(r"[\u4E00-\u9FD5]+")
# 只取纯中文的字符,并且不在冲用词之中的
tokens = [ token for token in jieba.cut(content) if accepted_chars.match(token) and token not in stop_words]
# tokens = [ token for token in jieba.cut(content) if accepted_chars.match(token)]
count = collections.Counter(tokens)
len(count)
17136
vocabulary_size = 9000 # 总共18000左右的单词
# 由于语料库太小,所以大多数词只出现了一次,我们应该过滤掉出现次数少的单词
def build_dataset(words):
count = [['UNK', -1]]
count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
dictionary = dict()
for word, _ in count:
dictionary[word] = len(dictionary)
data = list()
unk_count = 0
for word in words:
if word in dictionary:
index = dictionary[word]
else:
index = 0 # dictionary['UNK']
unk_count = unk_count + 1
data.append(index)
count[0][1] = unk_count
reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
return data, count, dictionary, reverse_dictionary
data, count, dictionary, reverse_dictionary = build_dataset(tokens)
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10])
Most common words (+UNK) [['UNK', 8137], ('说', 1326), ('侯亮', 1315), ('平', 966), ('李达康', 735)]
Sample data [7540, 7541, 7542, 885, 7543, 3470, 7544, 4806, 207, 4807]
data_index = 0
def generate_batch(batch_size, num_skips, skip_window):
global data_index
assert batch_size % num_skips == 0
assert num_skips <= 2 * skip_window
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
span = 2 * skip_window + 1 # [ skip_window target skip_window ]
buffer = collections.deque(maxlen=span)
# 根据global data_index 产生数据
for _ in range(span):
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
for i in range(batch_size // num_skips):
target = skip_window # target label at the center of the buffer
targets_to_avoid = [ skip_window ]
for j in range(num_skips):
while target in targets_to_avoid:
target = random.randint(0, span - 1)
targets_to_avoid.append(target)
batch[i * num_skips + j] = buffer[skip_window]
labels[i * num_skips + j, 0] = buffer[target]
buffer.append(data[data_index]) # 此处 buffer 是 deque ,容量是 span ,所以每次新增数据进来,都会挤掉之前的数
data_index = (data_index + 1) % len(data)
return batch, labels
print('data:', [reverse_dictionary[di] for di in data[:8]])
for num_skips, skip_window in [(2, 1), (4, 2)]:
data_index = 0
batch, labels = generate_batch(batch_size=8, num_skips=num_skips, skip_window=skip_window)
print('\nwith num_skips = %d and skip_window = %d:' % (num_skips, skip_window))
print(' batch:', [reverse_dictionary[bi] for bi in batch])
print(' labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
data: ['书籍', '稻草人', '书屋', '名义', '作者', '周梅森', '内容简介', '展现']
with num_skips = 2 and skip_window = 1:
batch: ['稻草人', '稻草人', '书屋', '书屋', '名义', '名义', '作者', '作者']
labels: ['书籍', '书屋', '稻草人', '名义', '书屋', '作者', '周梅森', '名义']
with num_skips = 4 and skip_window = 2:
batch: ['书屋', '书屋', '书屋', '书屋', '名义', '名义', '名义', '名义']
labels: ['稻草人', '名义', '书籍', '作者', '周梅森', '稻草人', '书屋', '作者']
batch_size = 128
embedding_size = 50 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a label.
# We pick a random validation set to sample nearest neighbors. here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent.
valid_size = 8 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))
num_sampled = 64 # Number of negative examples to sample.
graph = tf.Graph()
with graph.as_default(), tf.device('/cpu:0'):
# Input data.
train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
# Variables.
embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
softmax_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))
softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
# Model.
# Look up embeddings for inputs.
embed = tf.nn.embedding_lookup(embeddings, train_dataset)
# Compute the softmax loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(
tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,
labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
# Optimizer.
# Note: The optimizer will optimize the softmax_weights AND the embeddings.
# This is because the embeddings are defined as a variable quantity and the
# optimizer's `minimize` method will by default modify all variable quantities
# that contribute to the tensor it is passed.
# See docs on `tf.train.Optimizer.minimize()` for more details.
optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
# Compute the similarity between minibatch examples and all embeddings.
# We use the cosine distance:
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(
normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
上面定义了模型,下面是具体的优化步骤
num_steps = 100001
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
print('Initialized')
average_loss = 0
for step in range(num_steps):
batch_data, batch_labels = generate_batch(
batch_size, num_skips, skip_window)
feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
_, l = session.run([optimizer, loss], feed_dict=feed_dict)
average_loss += l
if step % 2000 == 0:
if step > 0:
average_loss = average_loss / 2000
# The average loss is an estimate of the loss over the last 2000 batches.
print('Average loss at step %d: %f' % (step, average_loss))
average_loss = 0
# note that this is expensive (~20% slowdown if computed every 500 steps)
if step % 10000 == 0:
sim = similarity.eval()
for i in range(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k+1]
log = 'Nearest to %s:' % valid_word
for k in range(top_k):
close_word = reverse_dictionary[nearest[k]]
log = '%s %s,' % (log, close_word)
print(log)
final_embeddings = normalized_embeddings.eval()
Initialized
Average loss at step 0: 4.637629
Nearest to 书记: 一整瓶, 倒好, 一把, 袅袅, 不到, 三姐, 不低, 坦承,
Nearest to 局长: 递到, 报纸, 国旗, 网上, 对此, 吼, 既有, 幸亏,
Nearest to 问: 几遍, 就业, 商贩, 口水, 亲近, 周, 手枪, 庄严,
Nearest to 季: 屈来, 天花板, 韩剧, 栽倒, 判人, 审视, 有意, 内容简介,
Nearest to 工作: 阻力, 亲密, 大人物, 外柔内刚, 准, 万元, 讽刺, 草根,
Nearest to 陈清泉: 菏泽, 梁山, 铁板一块, 前脚, 运来, 酒量, 四十万, 黑色,
Nearest to 省委: 市委书记, 酒量, 交代问题, 外号, 漏洞, 虚报, 那轮, 走到,
Nearest to 太: 早已, 纹, 推开, 党, 命运, 始终, 情景, 处处长,
Average loss at step 2000: 3.955496
Average loss at step 4000: 3.414563
Average loss at step 6000: 3.149207
Average loss at step 8000: 2.955826
Average loss at step 10000: 2.786807
Nearest to 书记: 市长, 老道, 建设, 大鬼, 咬住, 坦承, 调过来, 特立独行,
Nearest to 局长: 座位, 跳脚, 杀人, 第二天, 顶天立地, 报纸, 眼眶, 背书,
Nearest to 问: 凄凉, 料到, 点头称是, 空降, 弥漫, 装, 猜, 几遍,
Nearest to 季: 韩剧, 愉快, 天花板, 民不聊生, 提供, 亲, 那双, 金矿,
Nearest to 工作: 代价, 外柔内刚, 珍藏, 提醒, 大人物, 不小, 讽刺, 激化矛盾,
Nearest to 陈清泉: 没能, 鹰, 威风凛凛, 鸡蛋, 冤枉, 四十万, 大鬼, 木板,
Nearest to 省委: 代表, 三级, 影像, 敏锐, 铸下, 研究, 板, 暗星,
Nearest to 太: 印记, 推开, 抖动, 托盘, 不放过, 虚假, 刻下, 稻草人,
Average loss at step 12000: 2.638048
Average loss at step 14000: 2.519046
Average loss at step 16000: 2.441717
Average loss at step 18000: 2.324694
Average loss at step 20000: 2.235700
Nearest to 书记: 老道, 建设, 咬住, 特立独行, 握住, 市长, 提议, 顺利,
Nearest to 局长: 座位, 杀人, 跳脚, 放风, 一段, 第二天, 眼眶, 号子,
Nearest to 问: 喜出望外, 询问, 料到, 凄凉, 咱老, 反复无常, 你别, 打造,
Nearest to 季: 韩剧, 天花板, 愉快, 那双, 躬, 提供, 我敢, 检察院,
Nearest to 工作: 代价, 珍藏, 外柔内刚, 女婿, 激化矛盾, 讽刺, 打死, 送往,
Nearest to 陈清泉: 没能, 鹰, 威风凛凛, 别弄, 笑呵呵, 木板, 夺走, 作品,
Nearest to 省委: 三级, 敏锐, 代表, 一票, 谢谢您, 放鞭炮, 妇联, 研究,
Nearest to 太: 印记, 刻下, 原件, 抖动, 不放过, 推开, 事替, 托盘,
Average loss at step 22000: 2.161899
Average loss at step 24000: 2.134306
Average loss at step 26000: 2.040577
Average loss at step 28000: 1.986363
Average loss at step 30000: 1.935009
Nearest to 书记: 清, 选择, 建设, 大鬼, 跺脚, 打仗, 咬住, 满是,
Nearest to 局长: 跳脚, 放风, 一段, 杀人, 座位, 情感, 口中, 顶天立地,
Nearest to 问: 喜出望外, 询问, 料到, 半真半假, 干什么, 咱老, 海子, 反复无常,
Nearest to 季: 天花板, 韩剧, 那双, 检察院, 对视, 我敢, 赵东, 滚落,
Nearest to 工作: 代价, 珍藏, 外柔内刚, 女婿, 送往, 打死, 领导, 讽刺,
Nearest to 陈清泉: 鹰, 没能, 应承, 笑呵呵, 别弄, 持股会, 威风凛凛, 服从,
Nearest to 省委: 三级, 敏锐, 代表, 妇联, 举重若轻, 一票, 澳门, 行管,
Nearest to 太: 刻下, 抖动, 师生, 温和, 原件, 进屋, 海底, 印记,
Average loss at step 32000: 1.944882
Average loss at step 34000: 1.865572
Average loss at step 36000: 1.827617
Average loss at step 38000: 1.789619
Average loss at step 40000: 1.817634
Nearest to 书记: 清, 跺脚, 选择, 包庇, 顺利, 目光如炬, 握住, 老同事,
Nearest to 局长: 放风, 一段, 跳脚, 座位, 杀人, 顶天立地, 湿润, 情感,
Nearest to 问: 掏出, 喜出望外, 半真半假, 干什么, 海子, 撕, 度假村, 道,
Nearest to 季: 对视, 天花板, 那双, 韩剧, 检察院, 吃惊, 我敢, 室内,
Nearest to 工作: 代价, 领导, 外柔内刚, 打死, 珍藏, 记起, 田野, 女婿,
Nearest to 陈清泉: 笑呵呵, 没能, 应承, 鹰, 作品, 撞倒, 眯, 服从,
Nearest to 省委: 三级, 举重若轻, 敏锐, 人事, 代表, 行管, 谢谢您, 一票,
Nearest to 太: 刻下, 抖动, 温和, 原件, 问得, 进屋, 海底, 师生,
Average loss at step 42000: 1.743098
Average loss at step 44000: 1.720936
Average loss at step 46000: 1.694092
Average loss at step 48000: 1.734099
Average loss at step 50000: 1.663270
Nearest to 书记: 清, 选择, 打仗, 跺脚, 老同事, 老道, 顺利, 建设,
Nearest to 局长: 放风, 一段, 跳脚, 座位, 泡, 近乎, 滚圆, 排除,
Nearest to 问: 半真半假, 帮帮, 度假村, 聊生, 干什么, 顺势, 大为, 喜出望外,
Nearest to 季: 天花板, 对视, 检察院, 韩剧, 那双, 吃惊, 室内, 挽回,
Nearest to 工作: 领导, 打死, 新台阶, 外柔内刚, 张图, 陷阱, 田野, 规划图,
Nearest to 陈清泉: 没能, 应承, 作品, 鹰, 风云, 撞倒, 别弄, 请问,
Nearest to 省委: 举重若轻, 人事, 三级, 代表, 滑头, 谢谢您, 澳门, 敏锐,
Nearest to 太: 刻下, 抖动, 师生, 齐腰, 温和, 清白, 喊道, 问得,
Average loss at step 52000: 1.645000
Average loss at step 54000: 1.627952
Average loss at step 56000: 1.672146
Average loss at step 58000: 1.605531
Average loss at step 60000: 1.595123
Nearest to 书记: 清, 跺脚, 老同事, 握住, 没数, 目光如炬, 打仗, 选择,
Nearest to 局长: 放风, 跳脚, 一段, 排除, 座位, 泡, 干掉, 情感,
Nearest to 问: 半真半假, 度假村, 帮帮, 大为, 聊生, 干什么, 喜出望外, 顺势,
Nearest to 季: 对视, 检察院, 天花板, 那双, 吃惊, 韩剧, 室内, 同伟,
Nearest to 工作: 领导, 陷阱, 张图, 汇报会, 新台阶, 打死, 田野, 著名作家,
Nearest to 陈清泉: 应承, 没能, 撞倒, 作品, 服从, 请问, 鹰, 风云,
Nearest to 省委: 举重若轻, 人事, 通气, 灰烬, 滑头, 妇联, 敏锐, 三级,
Nearest to 太: 师生, 刻下, 喊道, 遗憾, 温和, 抖动, 清白, 齐腰,
Average loss at step 62000: 1.579666
Average loss at step 64000: 1.627861
Average loss at step 66000: 1.564315
Average loss at step 68000: 1.556485
Average loss at step 70000: 1.545672
Nearest to 书记: 清, 跺脚, 打仗, 没数, 握住, 老同事, 包庇, 选择,
Nearest to 局长: 放风, 一段, 跳脚, 泡, 排除, 近乎, 座位, 情感,
Nearest to 问: 半真半假, 度假村, 大为, 帮帮, 顺势, 追问, 聊生, 海子,
Nearest to 季: 对视, 检察院, 天花板, 同伟, 韩剧, 那双, 吃惊, 室内,
Nearest to 工作: 汇报会, 领导, 陷阱, 新台阶, 著名作家, 规划图, 张图, 送往,
Nearest to 陈清泉: 没能, 撞倒, 风云, 应承, 鹰, 赶巧, 人员, 看不到,
Nearest to 省委: 举重若轻, 人事, 灰烬, 妇联, 三级, 谢谢您, 通气, 讲讲,
Nearest to 太: 抖动, 海底, 遗憾, 喊道, 师生, 问得, 温和, 清白,
Average loss at step 72000: 1.596544
Average loss at step 74000: 1.530987
Average loss at step 76000: 1.524676
Average loss at step 78000: 1.514958
Average loss at step 80000: 1.569173
Nearest to 书记: 清, 没数, 跺脚, 目光如炬, 握住, 选择, 打仗, 包庇,
Nearest to 局长: 放风, 跳脚, 一段, 泡, 排除, 领教, 干掉, 情感,
Nearest to 问: 度假村, 半真半假, 聊生, 大为, 海子, 顺势, 道, 帮帮,
Nearest to 季: 对视, 检察院, 同伟, 那双, 吃惊, 韩剧, 天花板, 室内,
Nearest to 工作: 领导, 汇报会, 陷阱, 张图, 新台阶, 著名作家, 女婿, 规划图,
Nearest to 陈清泉: 没能, 风云, 撞倒, 应承, 人员, 鹰, 请问, 眯,
Nearest to 省委: 人事, 举重若轻, 通气, 灰烬, 内容, 冷不丁, 打电话, 谢谢您,
Nearest to 太: 问得, 遗憾, 喊道, 清白, 抖动, 海底, 自学, 温和,
Average loss at step 82000: 1.505065
Average loss at step 84000: 1.498249
Average loss at step 86000: 1.494963
Average loss at step 88000: 1.545789
Average loss at step 90000: 1.484452
Nearest to 书记: 没数, 跺脚, 清, 打仗, 满是, 老道, 若有所思, 超前,
Nearest to 局长: 放风, 跳脚, 一段, 泡, 排除, 滚圆, 近乎, 拦住,
Nearest to 问: 半真半假, 帮帮, 大为, 顺势, 度假村, 聊生, 嘱咐, 干什么,
Nearest to 季: 对视, 检察院, 同伟, 吃惊, 天花板, 韩剧, 那双, 公安厅,
Nearest to 工作: 领导, 陷阱, 汇报会, 起码, 规划图, 著名作家, 新台阶, 张图,
Nearest to 陈清泉: 没能, 风云, 撞倒, 应承, 眯, 请问, 鹰, 几任,
Nearest to 省委: 举重若轻, 人事, 灰烬, 谢谢您, 通气, 老肖, 打电话, 讲讲,
Nearest to 太: 抖动, 问得, 遗憾, 海底, 师生, 清白, 喊道, 刻下,
Average loss at step 92000: 1.480309
Average loss at step 94000: 1.481867
Average loss at step 96000: 1.527709
Average loss at step 98000: 1.466527
Average loss at step 100000: 1.464015
Nearest to 书记: 没数, 目光如炬, 握住, 清, 跺脚, 若有所思, 老同事, 打仗,
Nearest to 局长: 跳脚, 放风, 一段, 泡, 排除, 常理, 近乎, 干掉,
Nearest to 问: 半真半假, 帮帮, 顺势, 大为, 聊生, 度假村, 后悔, 谢谢,
Nearest to 季: 对视, 检察院, 同伟, 那双, 吃惊, 公安厅, 韩剧, 天花板,
Nearest to 工作: 领导, 汇报会, 陷阱, 张图, 著名作家, 起码, 规划图, 送往,
Nearest to 陈清泉: 没能, 应承, 请问, 鹰, 撞倒, 风云, 人员, 别弄,
Nearest to 省委: 人事, 举重若轻, 灰烬, 国富, 谢谢您, 打电话, 内容, 讲讲,
Nearest to 太: 遗憾, 海底, 问得, 师生, 喊道, 刻下, 抖动, 齐腰,
阶段性结论
通过上面的数据,我们很尴尬的发现效果不佳,样本数据实在是太少了,我们需要更多的中文语料。本文只是为了解释原理,不做过多的训练。
下面我们用 gensim 库来再走一遍我们训练过程
import gensim.models.word2vec as w2v
# Dimensionality of the resulting word vectors.
#more dimensions, more computationally expensive to train
#but also more accurate
#more dimensions = more generalized
num_features = 300
# Minimum word count threshold.
min_word_count = 3
# Number of threads to run in parallel.
#more workers, faster we train
num_workers = multiprocessing.cpu_count()
# Context window length.
context_size = 7
# Downsample setting for frequent words.
#0 - 1e-5 is good for this
downsampling = 1e-3
# Seed for the RNG, to make the results reproducible.
#random number generator
#deterministic, good for debugging
seed = 1
# By default (`sg=0`), CBOW is used.
# Otherwise (`sg=1`), skip-gram is employed.
# `sample` = threshold for configuring which higher-frequency words are randomly downsampled;
# default is 1e-3, useful range is (0, 1e-5).
rmmy2vec = w2v.Word2Vec(
sg=1, # 使用 skip-gram 算法
seed=seed, # 为了让每次结果一致,给予一个固定值
workers=num_workers,
size=num_features,
min_count=min_word_count,
window=context_size,
sample=downsampling
)
此处采用的作用,即参数 downsampling的作用,可以参看:How does sub-sampling of frequent words work in the context of Word2Vec?
def sentence_to_wordlist(raw):
clean = re.sub("[^\u4E00-\u9FD5]"," ", raw)
words = jieba.cut(clean)
return list(words)
sentences = []
with open("../input/人民的名义.txt",'r') as f:
for line in f:
line = line.strip("\n")
sents = list(SentenceSplitter.split(line))
for sent in sents:
wordlist = sentence_to_wordlist(sent)
wordlist = [c for c in wordlist if c !=' ']
if wordlist:
sentences.append(wordlist)
token_count = sum([len(sentence) for sentence in sentences])
print("The book corpus contains {0:,} tokens".format(token_count))
The book corpus contains 135,583 tokens
rmmy2vec.build_vocab(sentences)
print("Word2Vec vocabulary length:", len(rmmy2vec.vocab))
Word2Vec vocabulary length: 5539
rmmy2vec.train(sentences)
488248
if not os.path.exists("trained"):
os.makedirs("trained")
rmmy2vec.save(os.path.join("trained", "rmmy2vec.w2v"))
探索下训练出来的模型
rmmy2vec = w2v.Word2Vec.load(os.path.join("trained", "rmmy2vec.w2v"))
rmmy2vec.most_similar('书记')
[('李', 0.9455972909927368),
('省委', 0.9354562163352966),
('政法委', 0.9347406625747681),
('田', 0.919845461845398),
('国富', 0.9158734679222107),
('育良', 0.896552562713623),
('易', 0.8955333232879639),
('省委书记', 0.8948370218276978),
('秘书', 0.8864266872406006),
('汇报', 0.8858834505081177)]
#my video - how to visualize a dataset easily
tsne = sklearn.manifold.TSNE(n_components=2, random_state=0)
all_word_vectors_matrix = rmmy2vec.syn0
all_word_vectors_matrix_2d = tsne.fit_transform(all_word_vectors_matrix)
points = pd.DataFrame(
[
(word, coords[0], coords[1])
for word, coords in [
(word, all_word_vectors_matrix_2d[rmmy2vec.vocab[word].index])
for word in rmmy2vec.vocab
]
],
columns=["word", "x", "y"]
)
sns.set_context("poster")
points.plot.scatter("x", "y", s=10, figsize=(20, 12))
<matplotlib.axes._subplots.AxesSubplot at 0x1161e5e48>
## 解决中文乱码问题
import matplotlib
matplotlib.rcParams['font.family'] = 'STFangsong'#用来正常显示中文标签
# mpl.rcParams['font.sans-serif'] = [u'STKaiti']
matplotlib.rcParams['axes.unicode_minus'] = False #用来正常显示负号
def plot_region(x_bounds, y_bounds):
slice = points[
(x_bounds[0] <= points.x) &
(points.x <= x_bounds[1]) &
(y_bounds[0] <= points.y) &
(points.y <= y_bounds[1])
]
ax = slice.plot.scatter("x", "y", s=35, figsize=(10, 8))
for i, point in slice.iterrows():
ax.text(point.x + 0.005, point.y + 0.005, point.word, fontsize=11)
plot_region(x_bounds=(-6, -5), y_bounds=(10, 11))
展示线性关系
def nearest_similarity_cosmul(start1, end1, end2):
similarities = rmmy2vec.most_similar_cosmul(
positive=[end2, start1],
negative=[end1]
)
start2 = similarities[0][0]
print("{start1} is related to {end1}, as {start2} is related to {end2}".format(**locals()))
return start2
nearest_similarity_cosmul("书记", "官员", "省委")
书记 is related to 官员, as 国富 is related to 省委
'国富'
以上例子只为说明,真正训练的时候我看大多数用的都是wiki的中文语料,有两篇文档讲了这个过程,可以看: