bert的使用

1.预测词
很容易实现，预测定理性的词语效果比较好，预测句子中间的词语效果比较好。
当预测其他领域的语句和句子末尾的词的时候，效果比较差。
实现如下：

import sys
import codecs
import numpy as np
from keras_bert import load_trained_model_from_checkpoint, Tokenizer

config_path, checkpoint_path, dict_path ='/home/h/models/chinese_L-12_H-768_A-12/bert_config.json',\
'/home/h/models/chinese_L-12_H-768_A-12/bert_model.ckpt','/home/h/models/chinese_L-12_H-768_A-12/vocab.txt'

model = load_trained_model_from_checkpoint(config_path, checkpoint_path, training=True)

token_dict = {}
with codecs.open(dict_path, 'r', 'utf8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)
token_dict_inv = {v: k for k, v in token_dict.items()}

tokenizer = Tokenizer(token_dict)

text = '怎么添加农历初一、十五的提醒'
test_str="提醒"
pos = text.index(test_str)
index=[_+pos+1 for _ in range(len(test_str))]
print("index:",index)

tokens = tokenizer.tokenize(text)
for s in index:
    tokens[s]='[MASK]'



indices = np.array([[token_dict[token] for token in tokens] + [0] * (512 - len(tokens))])
segments = np.array([[0] * len(tokens) + [0] * (512 - len(tokens))])
masks = np.array([[0]*index[0]+[1, 1] + [0] * (512 - index[0]-2)])

predicts = model.predict([indices, segments, masks])[0].argmax(axis=-1).tolist()
print('Fill with: ', list(map(lambda x: token_dict_inv[x], predicts[0][index[0]:index[0]+len(test_str)])))

第二段中，不断修改第一二行，就可以实现各种预测和对应结果。

预测下一个句子
词语预测的都是不是特别好，预测句子更不靠谱。
我不太明白为什么示例句子中都是先给好了要预测的句子，如果把这种信息传进去还谈什么预测呢。

3.生成句向量，供下游任务使用
我在判断句子相似度上使用bert生成的句向量，然后用余弦相似度计算相似性，替换每个词为近义词的句子，相似度为96%，无关的句子也有90%、91%。不知道是不是我的衡量方式不好，还是应该之选取最大的就可以了。
再试试其他衡量形似度的方法。

bert的使用

推荐阅读更多精彩内容