Lucene的中文分析器的使用及如何在Java代码中实现

Lucene的默认分析器分析中文时,会将中文单个字建为索引,对于日后的查找极为的不方便,IKAnalyzer对中文的支持很好的解决了这个问题。

 IKAnalyzer的使用方法

 1) 把IKAnalyzer的jar包添加到工程中

 2) 把配置文件和扩展词典添加到工程的classpath下

 注意:扩展词典严禁使用windows记事本编辑保证扩展词典的编码格式是utf-8(因为windows记事本默认保存的格式是utf-8+BOM)

 扩展词典:添加一些新词

 停用词词典:无意义的词或者是敏感词汇

代码如下:

package com.itheima;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.junit.Test;
import org.wltea.analyzer.lucene.IKAnalyzer;

/**
 * @ClassName luceneFour
 * @Description TODO
 * @Author gkz
 * @Date 2019/8/22 00:48
 * @Version 1.0
 **/
public class luceneFour {

    @Test
    public void testTokenStream() throws Exception{
//        1) 创建一个Analyzer对象,StandardAnalyzer对象
        Analyzer analyzer=new IKAnalyzer();
//        2) 使用分析器对象的tokenStream方法获得一个TokenStream对象
        TokenStream tokenStream = analyzer.tokenStream("", "Lucene是apache软件基金会4 jakarta项目组的一个子项目，是一个开放源代码的全文检索引擎工具包，但它不是一个完整的全文检索引擎，而是一个全文检索引擎的架构，提供了完整的查询引擎和索引引擎，部分文本分析引擎（英文与德文两种西方语言）。Lucene的目的是为软件开发人员提供一个简单易用的工具包，以方便的在目标系统中实现全文检索的功能，或者是以此为基础建立起完整的全文检索引擎。Lucene是一套用于全文检索和搜寻的开源程式库，由Apache软件基金会支持和提供。Lucene提供了一个简单却强大的应用程式接口，能够做全文索引和搜寻。在Java开发环境里Lucene是一个成熟的免费开源工具。就其本身而言，Lucene是当前以及最近几年最受欢迎的免费Java信息检索程序库。人们经常提到信息检索程序库，虽然与搜索引擎有关，但不应该将信息检索程序库与搜索引擎相混淆。");
//        3) 向TokenStream对象中设置一个引用,相当于是一个指针
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
//        4) 调用TokenStream对象的rest方法，如果不调用会抛异常
        tokenStream.reset();
//        5) 使用while循环遍历TokenSteam对象
        while(tokenStream.incrementToken()){
            System.out.println(charTermAttribute.toString());
        }
//        6) 关闭TokenStream对象
        tokenStream.close();
    }
}

运行之后可以在控制台看到输出的效果:

D:\Java\jdk1.8.0_181\bin\java.exe -ea -Didea.test.cyclic.buffer.size=1048576 "-javaagent:D:\IntelliJ IDEA 2019.1.3\lib\idea_rt.jar=3928:D:\IntelliJ IDEA 2019.1.3\bin" -Dfile.encoding=UTF-8 -classpath "D:\IntelliJ IDEA 2019.1.3\lib\idea_rt.jar;D:\IntelliJ IDEA 2019.1.3\plugins\junit\lib\junit-rt.jar;D:\IntelliJ IDEA 2019.1.3\plugins\junit\lib\junit5-rt.jar;D:\Java\jdk1.8.0_181\jre\lib\charsets.jar;D:\Java\jdk1.8.0_181\jre\lib\deploy.jar;D:\Java\jdk1.8.0_181\jre\lib\ext\access-bridge-64.jar;D:\Java\jdk1.8.0_181\jre\lib\ext\cldrdata.jar;D:\Java\jdk1.8.0_181\jre\lib\ext\dnsns.jar;D:\Java\jdk1.8.0_181\jre\lib\ext\jaccess.jar;D:\Java\jdk1.8.0_181\jre\lib\ext\jfxrt.jar;D:\Java\jdk1.8.0_181\jre\lib\ext\localedata.jar;D:\Java\jdk1.8.0_181\jre\lib\ext\nashorn.jar;D:\Java\jdk1.8.0_181\jre\lib\ext\sunec.jar;D:\Java\jdk1.8.0_181\jre\lib\ext\sunjce_provider.jar;D:\Java\jdk1.8.0_181\jre\lib\ext\sunmscapi.jar;D:\Java\jdk1.8.0_181\jre\lib\ext\sunpkcs11.jar;D:\Java\jdk1.8.0_181\jre\lib\ext\zipfs.jar;D:\Java\jdk1.8.0_181\jre\lib\javaws.jar;D:\Java\jdk1.8.0_181\jre\lib\jce.jar;D:\Java\jdk1.8.0_181\jre\lib\jfr.jar;D:\Java\jdk1.8.0_181\jre\lib\jfxswt.jar;D:\Java\jdk1.8.0_181\jre\lib\jsse.jar;D:\Java\jdk1.8.0_181\jre\lib\management-agent.jar;D:\Java\jdk1.8.0_181\jre\lib\plugin.jar;D:\Java\jdk1.8.0_181\jre\lib\resources.jar;D:\Java\jdk1.8.0_181\jre\lib\rt.jar;D:\IdeaProject\luceneproject\target\classes;D:\repository\commons-io\commons-io\2.6\commons-io-2.6.jar;D:\repository\org\apache\lucene\lucene-core\7.7.2\lucene-core-7.7.2.jar;D:\repository\org\apache\lucene\lucene-analyzers-common\7.7.2\lucene-analyzers-common-7.7.2.jar;D:\repository\junit\junit\4.12\junit-4.12.jar;D:\repository\org\hamcrest\hamcrest-core\1.3\hamcrest-core-1.3.jar;D:\IdeaProject\luceneproject\lib\IK-Analyzer-1.0-SNAPSHOT.jar" com.intellij.rt.execution.junit.JUnitStarter -ideVersion5 -junit4 com.itheima.luceneFour,testTokenStream
lucene
是
apache
软件
基金会
基金
会
4
jakarta
项目
组
的
一个
一
个子
个
子项目
子项
项目
是
一个
一
个
开放源代码
开放
源代码
代码
的
全文
检索
索引
引擎
工具包
工具
包
但它
它不
不是
一个
一
个
完整
的
全文
检索
索引
引擎
而是
一个
一
个
全文
检索
索引
引擎
的
架构
提供
了
完整
的
查询
引擎
和
索引
引擎
部分
分文
文本
本分
分析
引擎
英文
与
德文
两种
两
种
西方
语言
lucene
的
目的
的
是
为
软件开发
软件
开发人员
开发
发人
人员
提供
一个
一
个
简单
易用
的
工具包
工具
包
以方
方便
的
在
目标
系统
中
实现
全文
检索
的
功能
或者是
或者
是以
以此为基础
以此
此为
基础
建立起
建立
立起
完整
的
全文
检索
索引
引擎
lucene
是
一套
一
套用
套
用于
全文
检索
和
搜寻
寻的
的
开源
程式库
程式
库
由
apache
软件
基金会
基金
会
支持
和
提供
lucene
提供
了
一个
一
个
简单
却
强大
的
应用程式
应用
程式
接口
能够
做
全文索引
全文
索引
和
搜寻
在
java
开发
环境
里
lucene
是
一个
一
个
成熟
的
免费
开源
工具
就其
本身
而言
lucene
是
当前
以及
最近
近几年
几年
最受
受欢迎
欢迎
的
免费
java
信息
检索
程序库
程序
库
人们
经常
提到
信息
检索
程序库
程序
库
虽然
与
搜索引擎
搜索
索引
引擎
有关
有
关
但不
不应该
不应
应该
将
信息
检索
程序库
程序
库
与
搜索引擎
搜索
索引
引擎
相混
混淆

Process finished with exit code 0

当然你想加什么词汇变成索引可以加在hotword.dic,有不想出现的词语可以添加在stopword中,IKAnalyzer.cfg.xml中所对应的配置就i是你的热词和禁止词语,有多个可以在封号后面继续添加.

那么我们如何在Java代码中使用IKAnalyzer来进行分词呢?

代码如下:

package com.itheima;


import org.apache.commons.io.FileUtils;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.junit.Test;
import org.wltea.analyzer.lucene.IKAnalyzer;

import java.io.File;

/**
 * @ClassName lucene
 * @Description TODO
 * @Author gkz
 * @Date 2019/8/21 18:02
 * @Version 1.0
 **/
public class luceneFirst {

    @Test
    public void createIndex() throws Exception{
//        1.创建一个Director对象,指定索引库的位置。

        //把索引保存在内存中
//        Directory dictionary=new RAMDirectory();

        //把索引保存在磁盘中
          Directory directory= FSDirectory.open(new File("E:\\Desktop").toPath());
//        2.基于Directory对象来创建一个indexWriter对象
        IndexWriterConfig config=new IndexWriterConfig(new IKAnalyzer());
          IndexWriter indexWriter=new IndexWriter(directory,config);
//        3.读取磁盘上的文件，对应每个文件创建一个文档对象。
          File file=new File("E:\\Desktop\\87.lucene\\lucene\\02.参考资料\\searchsource");
          File[] files=file.listFiles();
        for (File file1 : files) {
            //取文件名
            String file1Name=file1.getName();
            //文件的路径
            String path=file1.getPath();
            //文件的内容
            String fileContext = FileUtils.readFileToString(file1, "utf-8");
            //文件的大小
            long size = FileUtils.sizeOf(file1);
            //创建Field
            //参数1:域的名称,参数2：域的内容，参数3：是否储存
            Field fieldName=new TextField("name",file1Name, Field.Store.YES);
            Field fieldPath=new TextField("path",path, Field.Store.YES);
            Field fieldContext=new TextField("context",fileContext, Field.Store.YES);
            Field fieldSize=new TextField("size",size+"", Field.Store.YES);
            //创建文档对象
            Document document=new Document();
//        4.向文档对象中添加域
            document.add(fieldName);
            document.add(fieldPath);
            document.add(fieldContext);
            document.add(fieldSize);
//        5.把文档对象写入索引库
            indexWriter.addDocument(document);
        }
//        6.关闭indexWriter对象
            indexWriter.close();
    }
}

Lucene的中文分析器的使用及如何在Java代码中实现

Lucene的默认分析器分析中文时,会将中文单个字建为索引,对于日后的查找极为的不方便,IKAnalyzer对中文的支持很好的解决了这个问题。

推荐阅读更多精彩内容