上一节课相信大家已经训练出来了一个DJ的基础模型,本节课我们学习文本向量的处理,让AI去阅读《水浒传》,大家自行百度下载水浒传.txt。
分词
首先将水浒传分词,本人使用的是
<dependency>
<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>portable-1.7.3</version>
</dependency>
代码如下:
public static void data(File source, File save) throws IOException {
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader( new FileInputStream(source), "UTF-8"));
if(!save.exists()){
save.createNewFile();
}
OutputStreamWriter writerStream = new OutputStreamWriter( new FileOutputStream(save), "UTF-8");
BufferedWriter writer = new BufferedWriter(writerStream);
String line = null;
long startTime = System.currentTimeMillis();
while ((line = bufferedReader.readLine()) != null) {
StringBuilder stringBuilder = new StringBuilder();
for (Term term : HanLP.segment(line)) {
if (stringBuilder.length() > 0) {
stringBuilder.append(" ");
}
stringBuilder.append(term.word.trim());
}
writer.write( stringBuilder.toString() + "\n");
}
writer.flush();
writer.close();
System.out.println(System.currentTimeMillis() - startTime);
bufferedReader.close();
}
说明:加载小说,使用HanLP进行逐行分词并将结果以空格隔开,保存到硬盘中。
Word2Vec向量处理
public void train(String filePath) throws FileNotFoundException {
SentenceIterator iter = new BasicLineIterator(new File(filePath));
TokenizerFactory t = new DefaultTokenizerFactory();
t.setTokenPreProcessor(new CommonPreprocessor());
log.info("Building model....");
Word2Vec vec = new Word2Vec.Builder().
minWordFrequency(5)//出现频率小于5的不参与建模
.iterations(1)
.layerSize(200)//词向量长度
.seed(42)
.windowSize(5)//上下文窗口长度
.iterate(iter)
.tokenizerFactory(t)
.build();
vec.fit();
String[] names = {"大哥","林冲","宋江","武松","及时雨","招安","梁山"};
log.info("Closest Words:");
for (String name : names) {
System.out.println(name + ">>>>>>");
Collection<String> lst = vec.wordsNearest(name, 10);
System.out.println(lst);
}
}
说明:传入分词后的文件地址,把数据传给Word2Vec,调用wordsNearest输出与关键字相近的10个词语。
输出结果
image.png
提醒
AI模型的准确率有很多一部分取决于训练集的整理,在实战操作时多数时候也会花在数据整理上。希望大家能深度理解一下Word2Vec
下一节课讲解图片分类
本人诚接各类商业AI模型训练工作,如果您是一家公司,想借助AI解决当前服务问题,可以联系我。微信号:CompanyAiHelper