一·下载

腾讯词向量下载链接：https://ai.tencent.com/ailab/nlp/data/Tencent_AILab_ChineseEmbedding.tar.gz
腾讯词向量首页：https://ai.tencent.com/ailab/nlp/embedding.html

二·数据

上面的下载链接下载下来是一个压缩包 Tencent_AILab_ChineseEmbedding.tar.gz
解压后有两个文件 README.txt 与 Tencent_AILab_ChineseEmbedding.txt

用到的是这个 Tencent_AILab_ChineseEmbedding.txt，大小为 15.5G
其中的数据像这样：

Tencent_AILab_ChineseEmbedding.txt 的前三行

第一行是总词数以及embed size，然后从第二行起，每行都是一个词和对应词向量（用空格间隔），其实 README.txt 里也有解释：

README.txt

三·使用

其实腾讯词向量首页也写了怎么用，主要我是想记录一下读这玩意儿花了多久。用的 gensim 版本是 3.8.1，老版本 gensim 的 import 可能会不一样

from gensim.models import KeyedVectors
from time import time

tic= time()
file = '' # 解压出来的 Tencent_AILab_ChineseEmbedding.txt 的路径
wv_from_text = KeyedVectors.load_word2vec_format(file, binary=False) 
toc = time()
print(toc - tic)

runing...............
终于跑出来了，耗时 1531s，四舍五入也就是25分钟...看来还是得把需要的词提出来单独弄个文件啊

------------------------------- 2020.03.15 更新 -------------------------------
从腾讯词向量中提取自定义 Word Embedding 并存储：

save_folder = '' # 保存自定义词向量的文件夹
my_word_list = ["今天", "天气", "很好"] # 需要提取的词列表
tencent_embed_file = '' # 解压出来的 Tencent_AILab_ChineseEmbedding.txt 的路径

tic = clock()
wv_from_text = KeyedVectors.load_word2vec_format(tencent_embed_file, binary=False)
toc = clock()
print('read tencent embedding cost {:.2f}s'.format(toc - tic))

my_vector_list = []
  for word in custom_word_list:
    my_vector_list.append(wv_from_text[word])
print('my vocab size:', len(my_word_list), len(my_vector_list))

custom_wv = KeyedVectors(200) # 腾讯词向量大小为 200
custom_wv.add(my_word_list, my_vector_list)

save_file = os.path.join(save_folder, 'my-word-embedding.txt')
print("my vocab generated, and saving in {}".format(save_file))
my_wv.save_word2vec_format(save_file)

print('done.')

再也不用花半小时读腾讯的800万词向量了！

【记录】使用腾讯词向量与读取用时(约25min)，并提取自用词向量

【记录】使用腾讯词向量与读取用时(约25min)，并提取自用词向量

一·下载

二·数据

三·使用

推荐阅读更多精彩内容