Lucece评分公式相关性实践(下)

背景：
ES5及后面版本使用的Lucene6.2，所以默认使用的BM25评分公式，我们实践一下，看看BM25公式对打分的影响。

1、BM25配置实验

1）准备索引
建立mapping，
使用IK分词器测试也可以用空格分词器，
建立一个shard为了结果统一好看一些，因为评分是在Lucene中进行的，所以多个shard的IDF和字段平均长度不是全局的，每个shard中的Index值是不一至。
设置默认和自定义两个评分器，分别配置到text和title中。

{
  "settings": {
    "number_of_shards": 1,
    "similarity": {
      "my_bm25": {
        "type": "BM25",
        "b": 0,
        "k1": 0
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "ik_smart",
          "similarity": "my_bm25"
        },
        "text": {
          "type": "text",
          "analyzer": "ik_smart",
          "similarity": "BM25"
        }
      }
    }
  }
}

内容：
doc1: “text”: “b c d e f g”
doc2: “text”: “b c d”
doc3: “text”: “b c d b c d”
doc4: “text”: “h”
2）检索条件

{
  "explain": true,  // 展示打分细节
  "query": {
    "match": {
      "text": "c"
    }
  }
}

3）打分细节
结果和得分
doc3: 0.42996433
doc2: 0.3973088
doc1: 0.2961075
IDF 大家的打分都一样如下：

"value": 0.35667494,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 3,
"description": "docFreq", // 命中DOC数
"details": [ ]
}
,
{
"value": 4,
"description": "docCount", // 索引总DOC数
"details": [ ]
}
]

Doc3的tfNorm得分：
{

"value": 1.2054795,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 2,
"description": "termFreq=2.0",
"details": [ ]
}
,
{
"value": 1.2,
"description": "parameter k1",
"details": [ ]
}
,
{
"value": 0.75,
"description": "parameter b",
"details": [ ]
}
,
{
"value": 4,
"description": "avgFieldLength",
"details": [ ]
}
,
{
"value": 6,
"description": "fieldLength",
"details": [ ]
}
]
}
]

Doc2得分：

{
"value": 1.113924,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": [ ]
}
,
{
"value": 1.2,
"description": "parameter k1",
"details": [ ]
}
,
{
"value": 0.75,
"description": "parameter b",
"details": [ ]
}
,
{
"value": 4,
"description": "avgFieldLength",
"details": [ ]
}
,
{
"value": 3,
"description": "fieldLength",
"details": [ ]
}
]
}

Doc3得分：

{
"value": 0.8301887,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": [ ]
}
,
{
"value": 1.2,
"description": "parameter k1",
"details": [ ]
}
,
{
"value": 0.75,
"description": "parameter b",
"details": [ ]
}
,
{
"value": 4,
"description": "avgFieldLength",
"details": [ ]
}
,
{
"value": 6,
"description": "fieldLength",
"details": [ ]
}
]
}

设置了b和k1查询条件, title使用自定义评分器 b=0,表示长度无效，k1=0表示词频无效，那么得分相同怎么排序那？得分相同分按docid进行排序。：
{
“explain”: true,
“query”: {
“match”: {
“title”: “c”
}
}
}
IDF与之前一至无变化。
结果顺序
doc1: 0.35667494
doc2: 0.35667494
doc3: 0.35667494
tfnode得分都为1了，这个变化有些极端了，只是用于实验大家在修改时根据自己的需求进行调整b,k1的值。

{
"value": 1,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": [ ]
}
,
{
"value": 0,
"description": "parameter k1",
"details": [ ]
}
,
{
"value": 0,
"description": "parameter b",
"details": [ ]
}
,
{
"value": 4,
"description": "avgFieldLength",
"details": [ ]
}
,
{
"value": 6,
"description": "fieldLength",
"details": [ ]
}
]
}

2、关于discount_overlaps配置

官方解析:
| discount_overlaps | 决定重叠词元（词元的位置增量为0）是否要被忽略。默认值为true，意味着重叠词元不会被统计。
实际上那我测试是不起作用 wait?
1）索引建立代码
{
“settings”: {
“number_of_shards”: 1,
“number_of_replicas”: 0,
“index”: {
“similarity”: {
“my_bm25”: {
“type”: “BM25”,
“discount_overlaps”: false
}
}
}
},
“mappings”: {
“doc”: {
“properties”: {
“title”: {
“type”: “text”,
“analyzer”: “ik_max_word_test”,
“similarity”: “my_bm25”
}
}
}
}
}
2）ik_max_word_test 分词器

在这里插入图片描述

3）检索查看评分结果
搜索 term=北京显然出现了重叠term了，但是打分依然freq=2.0，不管discount_overlaps为true或false，所以觉得这东西不管用，请有了解的大牛帮忙解惑。

在这里插入图片描述

3、其它打分模型
1）DFR 偏离随机性
DFR模型是通过实例化框架的三个组件而获得的：基础随机性模型，一次归一化和归一化频率项。
简单思想：“命中文档内Term频率与集合中其它文档频率的差异越大，文档 d中Term携带的信息越多”。换句话说，Term权重与随机性模型M获得的文档d中Term频率的概率成反比。

基础随机性模型
其中下标M代表用于计算概率的随机性模型的类型。为了选择适当的随机模型M，我们可以使用不同的urn模型。因此，IR被认为是一个概率过程，它使用了模型的随机绘制，或者等效地将彩色球随机放置到中。而不是黑盒了。
[图片上传失败...(image-b5b88f-1597937906066)]

基础模型选择列表如下：

在这里插入图片描述

规范化
当文档中没有出现稀有Term时，它对文档提供信息的可能性几乎为零。相反，如果一个稀有Term在文档中出现很多，那么它很有可能（几乎可以肯定）为该文档描述的主题提供信息。DFR模型中包含下降。如果文档中的Term频率很高，则Term不提供信息的风险会最小。
计算文档中带有Term的信息增益：Laplace L模型和两个伯努利B模型。

在这里插入图片描述

Term频率归一化
文档长度 d1标准化为标准长度s1。因此，还根据标准文档长度重新计算Term频率tf，即：
[图片上传失败...(image-396a9a-1597937906066)]

DFR相似度模型下参数配置：
·basic_model：该参数值可设置为be、d、g、if、in和ine。
·after_effect：该参数值可设置为no、b和l。
·normalization：该参数值可设置为no、h1、h2、h3和z。
如果normalization参数值不是no，则需要设置归一化因子。归一化因子的设置依赖于所选的normalization参数值。参数值为h1时，使用normalization.h1.c属性；参数值为h2时，使用normalization.h2.c属性；参数值为h3时，使用normalization.h3.c属性；参数值为z时，使用normalization.z.z属性。这些属性值的数据类型均为浮点型。
例：

在这里插入图片描述

2）DFI
基于卡方统计量（即，在词频率tf中与独立性之间的标准卡方距离）实现独立性（DFI）模型。
相关连接：https://trec.nist.gov/pubs/trec21/papers/irra.web.nb.pdf

3）IB
该算法基于以下概念：任何符号分布序列中的信息内容主要由其基本元素的重复使用决定。对于书面文本，这个方式会对比不同作者的写作风格。distribution属性(取值范围为ll或者spl)
lambda属性(取值范围为df或者 tff)
normalization属性与DFR相似度相同。
配置例子：

"similarity" : {
    "esserverbook_ib_similarity" : {
        "type" : "IB",
        "distribution" : "ll",
        "lambda" : "df",
        "normalization" : "z",
        "normalization.z.z" : "0.25"
    }
}

如果喜欢搜索技术请来我的公众号吧 ‘Lucene Elasticsearch 工作积累’
每天会持续更新搜索相关技术

image.png