elasticsearch 数据建模（一）

文章转自：https://www.jianshu.com/p/098236cf3a44
https://blog.csdn.net/napoay/article/details/62233031

例1:电商推广数据结构

{
  "id": 536600477,
  "name": "黑色外穿打底裤女春秋薄款铅笔裤2019新款高腰九分显瘦紧身小脚裤",
  "image": "http://img.alicdn.com/bao/uploaded/i4/1687728515/O1CN015vKRk22Clv2z9jVKM_!!0-item_pic.jpg",
  "item_url":  "http://item.taobao.com/item.htm?id=536600477798",
  "shop_name": "XXX旗舰店",
  "price": 35.00,
  "sales": 12866,
  "contact_info": "XXX旗舰店",
  "short_url": "https://s.click.taobao.com/6dhjX0w",
  "sales_url":  "https://s.click.taobao.com/t?e=m%3D2%26s%3DhqNnFErxaS0cQipKwQzePOeEDrYVVa64K7Vc7tFgwiG3bLqV5UHdqSJ215tW5ra7%2Fl0%2B1yuzCtL9CVjm9%2FaTIMEcIrQjme5phH%2FwEhdaGdpwfW9VvJkbiUOLibAxXu8J4DrzI0Q%2Bh5mWydDa%2BK5%2FZ44CXhN9RDLu87eUjW4Ylwlp3E7b2H5imSCyCj9paIOIxiXvDf8DaRs%3D",
  "sales_pass":  "￥q6vvYNlY15Y￥",
  "coupon_total_num": 50000,
  "coupon_remaining_num":  49981,
  "coupon_quota": "满35减10",
  "coupon_start_date": "2019-09-20",
  "coupon_end_date": "2019-09-25",
  "coupon_url": "https://uland.taobao.com/coupon/edetail?e=EpEKjA4ejsRt3vqbdXnGlgxMgopp14njlHycenxkSuDwJfMHI%2FfVmw2KFrzHTGtgHv69%2F64THFCtOwU1ltpiC5ZrJ2LltVbgH31ZeQAUzbQ%3D&af=1&pid=mm_226490165_153450382_44990650090",
  "coupon_pass": "￥b0NmYNlbC8t￥",
  "coupon_short_url": "https://s.click.taobao.com/XRkjX0w"
}

"id"为整形，设置为long类型
"name" 名称是字符串类型，需要作为查询条件，并且需要分词。类型设置为"text"，指定中文分词器为"ik_max_word"，搜索的时候指定"ik_smart"分词器。
注意：1、"type": "text"会分词, "type": "keyword"不会分词
2、"ik_max_word" 为最细粒度分词，"ik_smart"为粗粒度分词,
索引时，为了提高索引的范围，通常会采用"ik_max_word" ,会以最细粒度分词索引，
搜索是，为了提高搜索的准确性，会采用"ik_smart"分词器为粗粒度分词；

ik_max_word: 会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可能的组合；
ik_smart: 会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。

字段mapping设置如下：

   "name":  {
        "type":  "text",
        "analyzer":  "ik_max_word",
        "search_analyzer": "ik_smart"
      },

"image" 字段是一个链接，不需要搜索，只需要显示就可以，索引不必添加索引，节省内存和空间，也不需要做集合分析，可以直接设置"enabled":false。其它类似需要也可以和这个字段一样处理。
"shop_name"是店铺名称，可以和"name"一样使用分词

"coupon_pass"是优惠券推广口令，不需要分词，但是需要放进索引中，设置"keyword"。
对应的数据模型

PUT item_index
{
 "mappings":  {
   "dynamic": false,
   "properties":  {
     "id":  {
       "type":  "long"
     },
     "name":  {
       "type":  "text",
       "analyzer":  "ik_max_word",
       "search_analyzer": "ik_smart"
     },
     "image":  {
       "enabled": false
     },
     "item_url":  {
       "enabled": false
     },
     "shop_name":  {
       "type":  "text",
       "analyzer":  "ik_max_word",
       "search_analyzer": "ik_smart",
       "fields": {
           "keyword": {
               "type":  "keyword"
            }
        }
     },
     "price":  {
       "type":  "double"
     },
     "sales":  {
       "type":  "integer"
     },
     "contact_info":  {
       "type":  "keyword"
     },
     "short_url":  {
       "enabled": false
     },
     "sales_url":  {
        "enabled": false
     },
     "sales_pass":  {
       "type":  "keyword"
     },
     "coupon_total_num":  {
       "type":  "integer"
     },
     "coupon_remaining_num":  {
       "type":  "integer"
     },
     "coupon_quota":  {
       "type":  "keyword"
     },
     "coupon_start_date":  {
       "type":  "date",
       "format":  "yyyy-MM-dd"
     },
     "coupon_end_date":  {
       "type":  "date",
       "format":  "yyyy-MM-dd"
     },
     "coupon_url":  {
       "enabled": false
     },
     "coupon_pass":  {
       "type":  "keyword"
     },
     "coupon_short_url":  {
       "enabled": false
     },
   }
 }
}

例2:服务器日志数据结构

222.67.85.228 - - [14/Nov/2018:14:30:34 +0800] "GET /search?keyword=&hasCoupon=0&pageNum=1&pageSize=100 HTTP/1.1" 200 12268 "-" "Apache-HttpClient/4.5.5 (Java/1.8.0_131)" "-"

通过日志格式化，将nginx日志转换成以下数据结构：

{
    "ip": "222.67.85.228",
    "username": "-",
    "time": "2018-11-14 14:30:34",
    "request_action": "GET",
    "request_url": "/search?keyword=&hasCoupon=0&pageNum=1&pageSize=100",
    "http_version": "1.1",
    "response_status": 200,
    "byte": 12268,
    "referrer": "-",
    "agent": "Apache-HttpClient/4.5.5 (Java/1.8.0_131)",
    "http_forward": "-"
}

一般查看日志按照时间和响应状态这两个维度作为查询条件。比如说，需要查询从2019年01月01日至今为止的响应状态为500的请求。整个日志字段基本不需要做分词处理，基本都是做一个展示，字符串数据基本就是"keyword"类型，日期类型注意格式化。

PUT nginx_log_index
{
    "mappings": {
        "dynamic": false,
        "properties":  {
            "ip":  {
                "type": "keyword"
            },
            "username":  {
                "type": "keyword"
            },
            "time":  {
                "type": "date",
                "format": "yyyy-MM-dd HH:mm:ss"
            },
            "request_action":  {
                "type": "keyword"
            },
            "request_url":  {
                "enabled": false
            },
            "http_version":  {
                "type": "keyword"
            },
            "response_status":  {
                "type": "integer"
            },
            "bytes":  {
                "type": "long"
            },
            "referrer":  {
                "type": "keyword"
            },
            "agent":  {
                "type": "keyword"
            },
            "http_forward":  {
                "type": "keyword"
            }
        }
    }
}

例3:博客数据结构

image.png

{
    "id": "89546eff3cd0",
    "url": "https://www.jianshu.com/p/89546eff3cd0",
    "title": "简单剖析代理模式实现原理",
    "author": "梦想实现家_Z",
    "content": "代理模式在java中随处可见，其他编程语言也一样，它的作用就是用来解耦的。代理模式又分为静态代理和动态代理。......省略剩下的内容",
    "time": "2019.04.10 21:08:21",
    "word_num": 1056,
    "read_num": 161,
    "like_num": 1,
    "reward_num": 0
}

因为博客内容特别大，避免每次查询都带上庞大的博客内容，建议将字段分开存储，查询的时候按需要展示。所有建议将"_source"字段设置为"enabled":false，但是要整的每个字段单独设置"store":true

PUT blog_index
{
    "mappings": {
        "dynamic": false,
        "_source": {
            "enabled": false
        }, 
        "properties":  {
            "id": {
                "type":  "keyword",
                "store":  true,
            },
            "url": {
                "type":  "keyword",
                "store":  true,
                "ignore_above":  100,
                "doc_values":  false,
                "norms":  false,
            },
            "title": {
                "type":  "text",
                "store":  true,
                "analyzer":  "ik_max_word",
                "search_analyzer": "ik_smart",
                "fields": {
                    "keyword": {
                        "type":  "keyword"
                    }
                }
            },
            "author": {
                "type":  "keyword",
                "store":  true,
            },
            "content": {
                "type":  "text",
                "analyzer":  "ik_max_word",
                "search_analyzer": "ik_smart",
                "store":  true
            },
            "time": {
                "type":  "text",
                "format":  "yyyy.MM.dd HH:mm:ss",
                "store":  true
            },
            "word_num": {
                "type":  "integer",
                "store":  true
            },
            "read_num": {
                "type":  "integer",
                "store":  true
            },
            "like_num": {
                "type":  "integer",
                "store":  true
            },
            "reward_num": {
                "type":  "integer",
                "store":  true
            }
        }
    }
}

补充一下，"_source" 是在默认配置是“true”，在某个字段特别大的情况下，不放入索引中，把大字段的内容存在Elasticsearch中只会增大索引，这一点文档数量越大结果越明显，如果一条文档节省几KB，放大到亿万级的量结果也是非常可观的。这里的博客内容就是这样的例子
"_source"的使用方法参考
参考：https://blog.csdn.net/napoay/article/details/62233031

elasticsearch 数据建模（一）

例1:电商推广数据结构

例3:博客数据结构

推荐阅读更多精彩内容