019.Elasticsearch搜索原理

1. 倒排索引原理初探

有以下两个文档:

# doc1
I really liked my small dogs, and I think my mom also liked them.
# doc2
He never liked any dogs, so I hope that my mom will not expect me to liked him

对文档进行分词,统计其在这两个doc中是否存在,*代表存在

word doc1 doc2
I * *
really *
liked * *
my * *
small *
dogs * *
and *
think *
mom * *
also *
them *
He *
never *
any *
so *
hope *
that *
will *
not *
expect *
me *
to *
him *

对单词进行常规化处理:

# 同义词转换
mom -> mother
small -> little
# 时态转换
liked -> like
# 类似:likes -> like
# 单复数转换
dogs -> dog
# 大小写转换
# 举例:Tom -> tom

重新建立倒排索引,加入常规化后的单词:

word 常规化单词 document id
I [1、2]
really [1]
liked like [1、2]
my [1、2]
small little [1]
dogs dog [1、2]
and [1]
think [1]
mom mother [1、2]
also [1]
them [1]
He [2]
never [2]
any [2]
so [2]
hope [2]
that [2]
will [2]
not [2]
expect [2]
me [2]
to [2]
him [2]

搜索"mother like little dog",首先分词,然后查看这些单词出现过的id,就返回了id为1和2的这两条文档

2. TF&IDF算法

ES使用TF&IDF算法(Term Frequency/Inverse Document Frequency)算法来计算document的评分,及其与搜索条件的匹配程度,此算法包括以下几个方面的影响因素:

  • Term Frequency:搜索词出现次数越多的文档越相关
doc1:hello you, and world is very good
doc2:hello, how are you

搜索请求:hello world
hello在doc1中出现1次,在doc2中出现1次
world在doc1中出现1次,在doc2中出现0次
所以doc1更相关
  • Inverse Document Frequency:搜索词在所有文档中出现的次数越多,则包含这个词的文档就越不相关
doc1:hello, today is very good
doc2:hi world, how are you
搜索请求:hello world
假设hello这个单词在全部文档中出现了10000次,world这个单词在全部文档中出现了100次
此时,doc1和doc2都是只能匹配一个单词,那么doc2的评分就比doc1靠前
  • Field-Length Norm:field的长度越长,相关度越弱
doc1:{ "title": "hello article", "content": "yes"(一万个yes)}
doc2:{ "title": "my article", "content": "world yes"(一万个yes)}
搜索请求:hello world
doc1和doc2同样都是匹配到一个单词,且都出现了1次,但是doc1匹配到单词的field(title)比doc2匹配到单词的field(content)长度短
所有doc1评分更高
  • 使用执行计划查看API可以获取score的具体计算公式
GET /test_index/test_type/_search?explain
{
  "query": {
    "match": {
      "test_field": "test hello"
    }
  }
}

3. 正排索引

搜索的时候,要依靠倒排索引,排序的时候,需要依靠正排索引,将每个document的每个field,然后进行排序,就是所谓的正排索引,在建立索引的时候,一方面会建立倒排索引,以供搜索用,一方面会建立正排索引,供排序,聚合,过滤等操作使用

doc1: {"name": "jack", "age": 27}
doc2: {"name": "tom", "age": 30}

以上两条doc的正排索引如下:

document id name age
1 jack 27
2 tom 30

在内存资源充足的时候,正排索引会缓存在内存中,否则就会写入磁盘

4. 倒排索引的结构

一条倒排索引的信息中包含以下内容:

  • 内容包含这个关键词的document list
  • 内容包含这个关键词的所有document的数量:IDF
  • 这个关键词在每个document中出现的次数:TF
  • 这个关键词在这个document中的位置
  • 内容包含这个关键词的每个document的长度:Length Norm
  • 内容包含这个关键词的所有document的平均长度

5. 字符串排序问题

如果对一个字符串进行排序,结果往往不准确,因为分词后是多个单词,再排序就不是我们想要的结果了,通常解决方案是,将一个字符串建立两次索引,一个分词,用来进行搜索,一个不分词,用来进行排序:

PUT /my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type": "text",
          "fielddata": true,
          "fields": {
            "raw": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

PUT /my_index/my_type/1
{
  "title": "first article"
}

PUT /my_index/my_type/2
{
  "title": "second article"
}

PUT /my_index/my_type/3
{
  "title": "third article"
}

# 使用分词的title进行排序
GET /my_index/my_type/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "title": {
        "order": "desc"
      }
    }
  ]
}

{
  "took" : 25,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : null,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "3",
        "_score" : null,
        "_source" : {
          "title" : "third article"
        },
        "sort" : [
          "third"
        ]
      },
      {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "2",
        "_score" : null,
        "_source" : {
          "title" : "second article"
        },
        "sort" : [
          "second"
        ]
      },
      {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "1",
        "_score" : null,
        "_source" : {
          "title" : "first article"
        },
        "sort" : [
          "first"
        ]
      }
    ]
  }
}

# 使用不分词的title进行排序
GET /my_index/my_type/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "title.raw": {
        "order": "desc"
      }
    }
  ]
}

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : null,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "3",
        "_score" : null,
        "_source" : {
          "title" : "third article"
        },
        "sort" : [
          "third article"
        ]
      },
      {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "2",
        "_score" : null,
        "_source" : {
          "title" : "second article"
        },
        "sort" : [
          "second article"
        ]
      },
      {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "1",
        "_score" : null,
        "_source" : {
          "title" : "first article"
        },
        "sort" : [
          "first article"
        ]
      }
    ]
  }
}
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。